Convert string values of the type (1.3 million rubles.) In the numeric type, and those that can not be converted to numbers replaced by NaN

Question

There is a df with thousands of object format columns that need to be formatted before converting to float . The problem is that sometimes there are not standard values that can be trimmed with str.replace , but such rare garbage values meaning completely different things than the essence of the column.

For example:

 r_data_executions_blocks_0_items_118_sum NaN 10165 10 000 руб. 4 781 760 руб. 1 40 922 руб. 1 200 руб. 1 201 844 руб. 1 177 579 руб. 1 34198/15/50006-ИП 1 415 руб. 1 21148/18/86014-ИП 1 1,3 млн руб. 1 176 427 руб. 1 Name: r_data_executions_blocks_0_items_118_sum, dtype: int64

It is seen that the amount of type 10 000 руб. can be cleared of currency cuts and order of magnitude, to remove the space between digits. Values of the same type 34198/15/50006-ИП should be replaced entirely with “NaN” using the ИП pattern.

Type Code

 for n in df_common_fin.columns: if 'sum' in n: df_common_fin[n] = np.where(df_common_fin[n].str.contains('ИП', regex=False), 'NaN', df_common_fin[n]) df_common_fin[n] = df_common_fin[n].str.contains('ИП', regex=False) df_common_fin[n] = df_common_fin[n].str.replace('руб', '') df_common_fin[n] = df_common_fin[n].str.replace('млн', '00000') df_common_fin[n] = df_common_fin[n].str.replace('млрд', '00000000') df_common_fin[n] = df_common_fin[n].str.replace('Исполнительный лист', "NaN") df_common_fin[n] = df_common_fin[n].str.replace('Сумма неизвестна', "NaN") df_common_fin[n].fillna('NaN', inplace=True) df_common_fin[n] = df_common_fin[n].str.replace('Нет', 'NaN') df_common_fin[n] = df_common_fin[n].str.replace('.', '') df_common_fin[n] = df_common_fin[n].str.replace(',', '') df_common_fin[n] = df_common_fin[n].str.replace(' ', '') df_common_fin[n] = df_common_fin[n].astype(np.float64)

Does not work.

Accepted Answer · 2018-11-13T12:24:23

In this case, I would process (replace) only the lines of interest (with sums) and at the very end use:

 df[col_name] = pd.to_numeric(df[col_name], errors='coerce')

errors='coerce' turns all strings that could not be converted to numeric values into NaN . Those. the pd.to_numeric() function takes care of all garbage.

 def clean_amount(col): to_repl = ["\s*", "\*", "руб\.?", ",", "млн\.?", "млрд\.?"] val = ["", "", "", ".", "*10**6", "*10**9"] col = col.replace(to_repl, val, regex=True) mask = pd.to_numeric(c.str.replace('*','',regex=False), errors='coerce').notna() col.loc[mask] = pd.eval(col.loc[mask]) return pd.to_numeric(col, errors='coerce')

Work example:

 In [327]: df Out[327]: col 0 NaN 1 10 000 руб. 2 781 760 руб. 3 40 922 руб. 4 200 руб. 5 201 844 руб. 6 177 579 руб. 7 34198/15/50006-ИП 8 415 руб. 9 21148/18/86014-ИП 10 1,3 млн руб. 11 176 427 руб. In [328]: df['col_clean'] = clean_amount(df['col']) In [329]: df Out[329]: col col_clean 0 NaN NaN 1 10 000 руб. 10000.0 2 781 760 руб. 781760.0 3 40 922 руб. 40922.0 4 200 руб. 200.0 5 201 844 руб. 201844.0 6 177 579 руб. 177579.0 7 34198/15/50006-ИП NaN 8 415 руб. 415.0 9 21148/18/86014-ИП NaN 10 1,3 млн руб. 1300000.0 11 176 427 руб. 176427.0

An error occurs: AttributeError: Can only use .str accessor with string values, which use np.object_dtype in pandas with this part of the signs is converted, part is not.
@StepanSokol, supplemented the answer with the implementation
After your comments, I tried to implement what I myself advised in the answer - in practice, the implementation turned out to be a bit more complicated than in theory (as is often the case - it's all about the little things) ;-)
Without trying, do not answer your question ... If all cases are taken into account, then you can use.
PS it is still lucky that Pandas can process texts in batches (in vectorized form) :)

Convert string values of the type (1.3 million rubles.) In the numeric type, and those that can not be converted to numbers replaced by NaN

1 answer 1

More articles:

Convert string values ​​of the type (1.3 million rubles.) In the numeric type, and those that can not be converted to numbers replaced by NaN

1 answer 1

More articles:

Convert string values of the type (1.3 million rubles.) In the numeric type, and those that can not be converted to numbers replaced by NaN