There is a csv file containing 8 columns and a huge number of lines, only numeric data. What is the easiest way to double the value of each cell?
- The easiest way to read, multiply, write. Which stages in the problem? - Vladimir Martyanov
- At the stage of multiplication. I know how to read and edit a file. - irvis
- print 5 * 7 - what's the problem with multiplication? - Vladimir Martyanov
- I read how rows, columns and so on multiply. How to multiply all the elements of a file by a number, I did not find. I do not program on python, I need only a part of its tooling for my tasks. Change the values to multiplied and overwrite the file. - irvis
- oneI have the code, I just need to enter two or three lines, which can be used to multiply. I can write them myself, the question is different, how can I do this optimally - multiply one line at a time or somehow? It would be enough to refer to any resource where such functionality is presented. - irvis
2 answers
Using Pandas, it is extremely simple and much faster than solutions using cycles:
import pandas as pd pd.read_csv('/path/to/file.csv').mul(2).to_csv('/path/to/result.csv', index=False)
Or, if your CSV does not have column names:
(pd.read_csv('/path/to/file.csv', header=None) .mul(2) .to_csv('/path/to/result.csv', header=None, index=False) )
If all data does not fit in the memory, Pandas allows you to process the data in chunks:
for chunk in pd.read_csv('/path/to/file.csv', header=None, chunksize=10**6): chunk.mul(2).to_csv('/path/to/result.csv', header=None, index=False, mode='a')
Test
Generate test data:
In [80]: x = pd.DataFrame(np.random.rand(10**6, 8)) In [81]: x.shape Out[81]: (1000000, 8) In [82]: x.head() Out[82]: 0 1 2 3 4 5 6 7 0 0.890351 0.814333 0.436384 0.477630 0.394325 0.339082 0.205078 0.033898 1 0.749606 0.045757 0.850344 0.496043 0.727086 0.717089 0.086069 0.826920 2 0.703955 0.492983 0.105174 0.298543 0.008731 0.885632 0.233774 0.046960 3 0.053236 0.024155 0.126744 0.139082 0.475896 0.747354 0.772720 0.816470 4 0.439865 0.344010 0.301929 0.528090 0.462750 0.113415 0.621818 0.796230 In [83]: x.to_csv('c:/temp/file.csv', index=False, header=None)
Size C:\Temp\file.csv
- 148MiB
Measuring speed on my regular (non SSD) drive:
In [84]: %%timeit ....: (pd.read_csv('c:/temp/file.csv', header=None) ....: .mul(2) ....: .to_csv('c:/temp/result.csv', header=None, index=False) ....: ) ....: 1 loop, best of 3: 30.4 s per loop
Result: 30 seconds to read the original CSV, multiply all elements and write the result back to CSV for 1 million rows and 8 columns.
- Thank you, that's exactly what was required. - irvis
- one@irvis if the number of lines is "huge" (such that all numbers do not fit in memory), then the file will have to be processed in parts (for example, line by line — if the task is limited by disk performance (IO bound), then it is not particularly important how you multiply these lines ( you can read at least one number ) —in the whole, pandas is convenient for working with numeric tables. Measure your performance if this is important in your case. - jfs
- @jfs, thanks for the comment! I added code to handle this case in Pandas ... - MaxU
If you need to perform a task using only the standard library (otherwise pandas
as @MaxU demonstrated, is ideally suited for this task):
#!/usr/bin/env python3 import fileinput import sys with fileinput.FileInput(inplace=True, backup='.bak') as file: for line in file: try: # multiply whitespace-separated float numbers print(*[2*float(f) for f in line.split()]) except ValueError as e: print("Can't parse %r line, error: %s" % (line, e), file=sys.stderr)
This code modifies the files specified on the command line, or stdin. Each source file is moved to a *.bak
file in case of an error, and stdout is redirected to the source file to make changes on the spot.
- So I tried, but, frankly, did not have enough knowledge to complete. - irvis