Saving DataFrame to file with breakdown and number of lines in file names

Question

There is a DataFrame, for example with the following content:

The task is to split the DataFrame into files, in which there will be a specified number of lines (set in a variable) and a file residue, if there is no multiple content left. Also, each file must be numbered with a save sequence number.

In life, the file has about 10 000 - 50 000 lines. Perhaps this is important. If there are variations on optimization, I will be glad to see them (for example, this df will contain more than 10,000,000 lines and it will be necessary to hit files differently in order to optimize resources.

According to the DataFame from the example (in df 5 lines) the following files should turn out:

sample_1_2.csv
sample_2_2.csv
sample_3_1.csv

Accepted Answer · 2018-07-24T08:17:08

Example:

df = pd.DataFrame(np.random.randint(10, size=(33, 3)), columns=list('abc')) n = 10 (df.assign(x=np.arange(len(df)) // n) .groupby('x') .apply(lambda g: g.drop('x', 1) .to_csv('d:/temp/file_{:03d}_{}.csv' .format(g['x'].values[0], len(g)), index=False)))

list of files:

 file_000_10.csv file_001_10.csv file_002_10.csv file_003_3.csv

At the output I received three files with a sequence number, but the sequence number in the name should be followed by the value of the total number of lines in each file.
I also understand that everything will become more complicated if we add to the solution of @MaxU the desire to process the file (split) by applying a selection of columns.
For example, I want to do everything the same, but with df ['a'] AttributeError will pop up: 'Series' object has no attribute 'assign'.
@APmansib, use df[['a']] - this will return a single column

Saving DataFrame to file with breakdown and number of lines in file names

1 answer 1

More articles: