Categorizing values in multiple columns of a DataFrame

Question

Given: 700 columns with many similar object categorical variables. It is necessary: to replace the names of these variables in all observations so that there are 4-5 values. This is done by fragments of phrases.

Code:

for n in df_court.columns: # Для каждого столбца из датафрейма if df_court[n].dtype.name == 'object' and 'docName' in n: # если тип переменной столбца object,и есть фрагмент docName for value in df_court[n]: # в названии, для каждого элемента столбца, если фрагмента if 'Исполнительный лист' in value: # "Исполнительный лист" есть в названии, то присовить value = 'Положительный фактор' # названию значени "положительный фактор" print(value) # проверка выводом на печать

But he gives an error:

 TypeError Traceback (most recent call last) <ipython-input-70-cc61f6ae65fe> in <module>() 2 if df_court[n].dtype.name == 'object' and 'docName' in n: 3 for value in df_court[n]: ----> 4 if 'Исполнительный' in value: 5 value = 'збс' 6 print(value) TypeError: argument of type 'float' is not iterable

Prompt correct syntax, or logic, if it is also an error.

I would like to get from the input:

 Series: nan Исполнительное производство Исполнительное производство Отзыв иска Жалоба Жалоба Жалоба Жалоба Ходатайство Иск удовлетворить

Similar output:

 Series Нейтральный фактор # состоит из nan Положительный фактор Положительный фактор Негативный фактор Негативный фактор Негативный фактор Негативный фактор Негативный фактор Положительный фактор Положительный фактор

That is, rename categories by fragments of phrases, because all exact variations of phrases of special terms are not tracked.

In the given example, are they the values of one column or the names of different columns?
Yes, these are the same values in the df_court [n] column, where n is the name of the column from df_court.columns

Accepted Answer · 2018-10-27T09:39:18

Source DataFrame:

 In [35]: df Out[35]: docName1 num docName2 0 nan 31 Отзыв иска 1 Исполнительное производство 0 Ходатайство 2 Исполнительное производство 29 Жалоба 3 Отзыв иска 43 Исполнительное производство 4 Жалоба 34 nan 5 Жалоба 28 Иск удовлетворить 6 Жалоба 93 Исполнительное производство 7 Жалоба 62 Жалоба 8 Ходатайство 24 Жалоба 9 Иск удовлетворить 82 Жалоба

Solution: use the DataFrame.replace () method:

 positive = ['исполнительн.*\sпроизводств.*','ходатайство','иск\s*удовлетворить'] negative = ['отзыв\s*иска','жалоб.*'] neutral = ['nan'] def categorize(df, positive=positive, negative=negative, neutral=neutral): neg = '^{}$'.format('|'.join(negative)) pos = '^{}$'.format('|'.join(positive)) neut = '^{}$'.format('|'.join(neutral)) return df.replace([neg, pos, neut], ['Положительный фактор','Негативный фактор', 'Нейтральный фактор'], regex=True) mask = df.columns.str.contains('docName\d+') & df.dtypes.eq('object') df.loc[:, mask] = categorize(df.loc[:, mask].apply(lambda x: x.str.lower()), positive=positive, negative=negative, neutral=neutral)

Result:

 In [41]: df Out[41]: docName1 num docName2 0 Нейтральный фактор 31 Положительный фактор 1 Негативный фактор 0 Негативный фактор 2 Негативный фактор 29 Положительный фактор 3 Положительный фактор 43 Негативный фактор 4 Положительный фактор 34 Нейтральный фактор 5 Положительный фактор 28 Негативный фактор 6 Положительный фактор 93 Негативный фактор 7 Положительный фактор 62 Положительный фактор 8 Негативный фактор 24 Положительный фактор 9 Негативный фактор 82 Положительный фактор

For some reason it does not work ... Columns remain unchanged
@StepanSokol, I don’t even know how to help you without having a reproducible data set ...
And if there is a way to solve this problem without regulars?
In theory, it boils down to correct addressing in rows of a specific column.
@StepanSokol, of course, replace the regulars with string literals and remove the regex=True parameter.
Just in this case, the value of the entire cell is compared - substring replacement / search will not work this way.
@StepanSokol, I think it would be easier if you could not create a very large, but very similar to your data set of data and put it in the form of a CSV or Excel file on some file sharing service and bring the resulting / desired data set.
Then you would be able to celebrate the solution without reworking / adapting it ... PS In this case, it is worth opening a new question

Categorizing values in multiple columns of a DataFrame

1 answer 1

More articles:

Categorizing values ​​in multiple columns of a DataFrame

1 answer 1

More articles:

Categorizing values in multiple columns of a DataFrame