Work in pandas. The data looks like this:
merged4_new['pgfamstd'] Out[57]: 0 [1] verheiratet zus. 1 [1] verheiratet zus. 2 [1] verheiratet zus. 3 [1] verheiratet zus. ... 470702 [3] ledig 470703 [1] verheiratet zus. 470704 [3] ledig 470705 [1] verheiratet zus. 470706 [3] ledig
Looking at the distribution, I want to delete some variable values:
merged4_new['pgfamstd'].value_counts() Out[66]: [1] verheiratet zus. 289419 [3] ledig 108685 [4] geschieden 27042 [5] verwitwet 26310 [2] verheiratet getr. 7887 [6] Ehepartner im Ausland 825 -1.0 21 -3.0 10 Name: pgfamstd, dtype: int64
Namely, the values of [6] Ehepartner im Ausland, -1.0, -3.0. So far I have been able to do this only using the LabelEncoder from sklearn.preprocessing to assign new values to the variable pgfamstd
and then with numeric values for it we can get rid of negative values using the merged4_new = merged4[merged4['pgfamstd']>1]
command. But then the variable values assigned earlier lose the sequence and only a number of scattered categories remain. For example, instead of ranking 012345, 1345 remains.
I wanted to use a mask or data selection, using a logical AND &
, at the beginning, in order to remove some of the values from the very beginning. But I got an error. How do I remove a part of a variable so that the structure of the assigned numeric values to the categories of the variable remains consistent?
merged4_new['pgfamstd'].dtype Out[67]: dtype('O')