There is a df with thousands of object format columns that need to be formatted before converting to float . The problem is that sometimes there are not standard values ​​that can be trimmed with str.replace , but such rare garbage values ​​meaning completely different things than the essence of the column.

For example:

 r_data_executions_blocks_0_items_118_sum NaN 10165 10 000 руб. 4 781 760 руб. 1 40 922 руб. 1 200 руб. 1 201 844 руб. 1 177 579 руб. 1 34198/15/50006-ИП 1 415 руб. 1 21148/18/86014-ИП 1 1,3 млн руб. 1 176 427 руб. 1 Name: r_data_executions_blocks_0_items_118_sum, dtype: int64 

It is seen that the amount of type 10 000 руб. can be cleared of currency cuts and order of magnitude, to remove the space between digits. Values ​​of the same type 34198/15/50006-ИП should be replaced entirely with “NaN” using the ИП pattern.

Type Code

 for n in df_common_fin.columns: if 'sum' in n: df_common_fin[n] = np.where(df_common_fin[n].str.contains('ИП', regex=False), 'NaN', df_common_fin[n]) df_common_fin[n] = df_common_fin[n].str.contains('ИП', regex=False) df_common_fin[n] = df_common_fin[n].str.replace('руб', '') df_common_fin[n] = df_common_fin[n].str.replace('млн', '00000') df_common_fin[n] = df_common_fin[n].str.replace('млрд', '00000000') df_common_fin[n] = df_common_fin[n].str.replace('Исполнительный лист', "NaN") df_common_fin[n] = df_common_fin[n].str.replace('Сумма неизвестна', "NaN") df_common_fin[n].fillna('NaN', inplace=True) df_common_fin[n] = df_common_fin[n].str.replace('Нет', 'NaN') df_common_fin[n] = df_common_fin[n].str.replace('.', '') df_common_fin[n] = df_common_fin[n].str.replace(',', '') df_common_fin[n] = df_common_fin[n].str.replace(' ', '') df_common_fin[n] = df_common_fin[n].astype(np.float64) 

Does not work.

    1 answer 1

    In this case, I would process (replace) only the lines of interest (with sums) and at the very end use:

     df[col_name] = pd.to_numeric(df[col_name], errors='coerce') 

    errors='coerce' turns all strings that could not be converted to numeric values ​​into NaN . Those. the pd.to_numeric() function takes care of all garbage.


     def clean_amount(col): to_repl = ["\s*", "\*", "руб\.?", ",", "млн\.?", "млрд\.?"] val = ["", "", "", ".", "*10**6", "*10**9"] col = col.replace(to_repl, val, regex=True) mask = pd.to_numeric(c.str.replace('*','',regex=False), errors='coerce').notna() col.loc[mask] = pd.eval(col.loc[mask]) return pd.to_numeric(col, errors='coerce') 

    Work example:

     In [327]: df Out[327]: col 0 NaN 1 10 000 руб. 2 781 760 руб. 3 40 922 руб. 4 200 руб. 5 201 844 руб. 6 177 579 руб. 7 34198/15/50006-ИП 8 415 руб. 9 21148/18/86014-ИП 10 1,3 млн руб. 11 176 427 руб. In [328]: df['col_clean'] = clean_amount(df['col']) In [329]: df Out[329]: col col_clean 0 NaN NaN 1 10 000 руб. 10000.0 2 781 760 руб. 781760.0 3 40 922 руб. 40922.0 4 200 руб. 200.0 5 201 844 руб. 201844.0 6 177 579 руб. 177579.0 7 34198/15/50006-ИП NaN 8 415 руб. 415.0 9 21148/18/86014-ИП NaN 10 1,3 млн руб. 1300000.0 11 176 427 руб. 176427.0 
    • An error occurs: AttributeError: Can only use .str accessor with string values, which use np.object_dtype in pandas with this part of the signs is converted, part is not. What you need to check? - Stepan Sokol
    • one
      @StepanSokol, supplemented the answer with the implementation - MaxU
    • one
      It looks cool! Thank! - Stepan Sokol
    • one
      After your comments, I tried to implement what I myself advised in the answer - in practice, the implementation turned out to be a bit more complicated than in theory (as is often the case - it's all about the little things) ;-) - MaxU
    • one
      Without trying, do not answer your question ... If all cases are taken into account, then you can use. PS it is still lucky that Pandas can process texts in batches (in vectorized form) :) - MaxU