There is a DataFrame, for example with the following content:

DataFrame Example

The task is to split the DataFrame into files, in which there will be a specified number of lines (set in a variable) and a file residue, if there is no multiple content left. Also, each file must be numbered with a save sequence number.

In life, the file has about 10 000 - 50 000 lines. Perhaps this is important. If there are variations on optimization, I will be glad to see them (for example, this df will contain more than 10,000,000 lines and it will be necessary to hit files differently in order to optimize resources.

According to the DataFame from the example (in df 5 lines) the following files should turn out:

  • sample_1_2.csv
  • sample_2_2.csv
  • sample_3_1.csv

    1 answer 1

    Example:

    df = pd.DataFrame(np.random.randint(10, size=(33, 3)), columns=list('abc')) n = 10 (df.assign(x=np.arange(len(df)) // n) .groupby('x') .apply(lambda g: g.drop('x', 1) .to_csv('d:/temp/file_{:03d}_{}.csv' .format(g['x'].values[0], len(g)), index=False))) 

    list of files:

     file_000_10.csv file_001_10.csv file_002_10.csv file_003_3.csv 
    • Good day @ MaxU. Thank you - it really works almost right. At the output I received three files with a sequence number, but the sequence number in the name should be followed by the value of the total number of lines in each file. :-) - APmansib
    • I also understand that everything will become more complicated if we add to the solution of @MaxU the desire to process the file (split) by applying a selection of columns. For example, I want to do everything the same, but with df ['a'] AttributeError will pop up: 'Series' object has no attribute 'assign'. How to be in this case? - APmansib
    • @APmansib, use df[['a']] - this will return a single column DataFrame - MaxU