Work in pandas. The data looks like this:

merged4_new['pgfamstd'] Out[57]: 0 [1] verheiratet zus. 1 [1] verheiratet zus. 2 [1] verheiratet zus. 3 [1] verheiratet zus. ... 470702 [3] ledig 470703 [1] verheiratet zus. 470704 [3] ledig 470705 [1] verheiratet zus. 470706 [3] ledig 

Looking at the distribution, I want to delete some variable values:

 merged4_new['pgfamstd'].value_counts() Out[66]: [1] verheiratet zus. 289419 [3] ledig 108685 [4] geschieden 27042 [5] verwitwet 26310 [2] verheiratet getr. 7887 [6] Ehepartner im Ausland 825 -1.0 21 -3.0 10 Name: pgfamstd, dtype: int64 

Namely, the values ​​of [6] Ehepartner im Ausland, -1.0, -3.0. So far I have been able to do this only using the LabelEncoder from sklearn.preprocessing to assign new values ​​to the variable pgfamstd and then with numeric values ​​for it we can get rid of negative values ​​using the merged4_new = merged4[merged4['pgfamstd']>1] command. But then the variable values ​​assigned earlier lose the sequence and only a number of scattered categories remain. For example, instead of ranking 012345, 1345 remains.

I wanted to use a mask or data selection, using a logical AND & , at the beginning, in order to remove some of the values ​​from the very beginning. But I got an error. How do I remove a part of a variable so that the structure of the assigned numeric values ​​to the categories of the variable remains consistent?

 merged4_new['pgfamstd'].dtype Out[67]: dtype('O') 

    1 answer 1

    All this is quite easily done with pandas:

     In [46]: df Out[46]: pgfamstd 0 [1] verheiratet zus. 1 -3.0 2 [6] Ehepartner im Ausland 3 [2] verheiratet getr. 4 [3] ledig 5 -1.0 6 [4] geschieden 7 [1] verheiratet zus. 8 -3.0 9 -1.0 10 [1] verheiratet zus. 11 [6] Ehepartner im Ausland 12 [4] geschieden 13 [3] ledig 14 -1.0 15 [3] ledig 16 [5] verwitwet 17 [6] Ehepartner im Ausland 18 -3.0 19 [4] geschieden 20 [2] verheiratet getr. 21 [5] verwitwet 22 [2] verheiratet getr. 23 [5] verwitwet In [47]: vals_2_drop = ['[6] Ehepartner im Ausland','-1.0','-3.0'] In [50]: df = df[~df.pgfamstd.isin(vals_2_drop)] In [51]: df Out[51]: pgfamstd 0 [1] verheiratet zus. 3 [2] verheiratet getr. 4 [3] ledig 6 [4] geschieden 7 [1] verheiratet zus. 10 [1] verheiratet zus. 12 [4] geschieden 13 [3] ledig 15 [3] ledig 16 [5] verwitwet 19 [4] geschieden 20 [2] verheiratet getr. 21 [5] verwitwet 22 [2] verheiratet getr. 23 [5] verwitwet 

    Step by Step:

     In [49]: df.pgfamstd.isin(vals_2_drop) Out[49]: 0 False 1 True 2 True 3 False 4 False 5 True 6 False 7 False 8 True 9 True 10 False 11 True 12 False 13 False 14 True 15 False 16 False 17 True 18 True 19 False 20 False 21 False 22 False 23 False Name: pgfamstd, dtype: bool In [48]: df[~df.pgfamstd.isin(vals_2_drop)] Out[48]: pgfamstd 0 [1] verheiratet zus. 3 [2] verheiratet getr. 4 [3] ledig 6 [4] geschieden 7 [1] verheiratet zus. 10 [1] verheiratet zus. 12 [4] geschieden 13 [3] ledig 15 [3] ledig 16 [5] verwitwet 19 [4] geschieden 20 [2] verheiratet getr. 21 [5] verwitwet 22 [2] verheiratet getr. 23 [5] verwitwet 
    • MaxU, and if after deleting a variable to calculate the distribution function .value_counts() you change the values? after checking the code, the values ​​of this command remain the same, but if you write the condition if X.any()==True:... , then it turns out that the values ​​are: '[6] Ehepartner im Ausland', '- 1.0', '- 3.0 ' no more. A ~ sign means: no? - user21
    • @ user21, I just deleted all the lines containing ['[6] Ehepartner im Ausland','-1.0','-3.0'] . value_counts() for the other values ​​should not change. ~ - operation of logical negation, i.e. df[~df.pgfamstd.isin(vals_2_drop)] - returns only those strings where pgfamstd NOT one of the values ​​included in vals_2_drop (in other words, ALL EXCEPT values ​​from vals_2_drop ) - MaxU
    • Thanks, but I did not mean the other values, but those that should have been deleted. merged4['pgfamstd'].value_counts() Out[5]: [1] verheiratet zus. 293703 [3] ledig 113596 [4] geschieden 27577 [5] verwitwet 26871 [2] verheiratet getr. 8058 -2.0 40 -1.0 23 -3.0 10 merged4['pgfamstd'].value_counts() Out[5]: [1] verheiratet zus. 293703 [3] ledig 113596 [4] geschieden 27577 [5] verwitwet 26871 [2] verheiratet getr. 8058 -2.0 40 -1.0 23 -3.0 10 - user21
    • @ user21, I don't quite understand - can you explain? Do you want to change the values, but NOT delete the lines? Then explain what values ​​you want to change: ['[6] Ehepartner im Ausland','-1.0','-3.0'] - MaxU
    • one
      I'll load the data from the beginning and try again. If you get different results again, I'll write. But in fact this data is no longer in the date set. The question is why merged4['pgfamstd'].value_counts() yields such results with me. Thank! - user21