I have a csv data file (7 columns and 6063 rows). The column names are something like ['id', 'seller', 'buyer', 'timestamp'] . And the corresponding data in the rows. You need to clear this file from the lines where the seller = buyer.

 import pandas as pd data=pd.read_csv('file.csv', sep=';', decimal=',') dat=pd.DataFrame(data.T) for i in dat: if dat[dat.columns[i]][1]==dat[dat.columns[i]][2]: a=dat.columns[i] 

I get something like this, but with the removal of columns (now these are already columns) I have a problem, since these are not columns that go in order, but I don’t really want to list the name of 1450 columns. Tell me how to do better here?

    1 answer 1

    Use the .query () method:

     data = pd.read_csv('file.csv', sep=';', decimal=',', quotechar="'").query('seller != buyer') 

    If you need to save back to CSV:

     data.to_csv('output.csv', index=False) 

    PS you do not need to transpose the DataFrame in order to filter it

    PPS If you use Pandas, try not to use for loop - this is not very effective

    Here is a working example , taking into account the fact that your CSV uses ' as quoting quote:

    CSV file - D:\temp\buyer_seller.csv :

     'id';'seller';'buyer';'timestamp' 1;seller-1;buyer-1;2016-01-01 2;seller-2;buyer-2;2016-01-02 3;same-1;same-1;2016-01-11 4;same-2;same-2;2016-01-22 

    Code:

     In [21]: pd.read_csv(r'D:\temp\buyer_seller.csv', sep=';') Out[21]: 'id' 'seller' 'buyer' 'timestamp' 0 1 seller-1 buyer-1 2016-01-01 1 2 seller-2 buyer-2 2016-01-02 2 3 same-1 same-1 2016-01-11 3 4 same-2 same-2 2016-01-22 In [22]: pd.read_csv(r'D:\temp\buyer_seller.csv', sep=';', quotechar="'") Out[22]: id seller buyer timestamp 0 1 seller-1 buyer-1 2016-01-01 1 2 seller-2 buyer-2 2016-01-02 2 3 same-1 same-1 2016-01-11 3 4 same-2 same-2 2016-01-22 In [23]: pd.read_csv(r'D:\temp\buyer_seller.csv', sep=';', quotechar="'").query('seller != buyer') Out[23]: id seller buyer timestamp 0 1 seller-1 buyer-1 2016-01-01 1 2 seller-2 buyer-2 2016-01-02 

    Alternatively, you can simply get rid of the quotes in the column / column names:

     In [27]: df = pd.read_csv(r'D:\temp\buyer_seller.csv', sep=';') In [28]: df.columns.tolist() Out[28]: ["'id'", "'seller'", "'buyer'", "'timestamp'"] In [30]: df.columns = df.columns.str.replace("'", '') In [31]: df.columns.tolist() Out[31]: ['id', 'seller', 'buyer', 'timestamp'] 
    • Everything would be fine, a great idea, only the column names in the file are actually recorded as 'seller' and 'buyer', which in this case is not perceived. A huge list of errors and at the end this is UndefinedVariableError: name 'seller' is not defined - Katia Nahornaya
    • My decision is based on the fact that the columns are actually called: buyer and seller . What does print(df.columns.tolist()) after reading CSV? - MaxU
    • @KatiaNahornaya, I updated the example in the answer ... - MaxU
    • It is very cool! Thank! - Katia Nahornaya
    • Can I still have a question about the timestamp column? In this column, the numbers are something like 1469502678. How can this be converted to a real date / time? - Katia Nahornaya