I work with pandas. And I have a data set with the target variable pglabgro . The list of variable values ​​is given below:

pglabgro (=pglabgro) - Current gross income in euros emp_full - full employment (binary variable) emp_not - unemployed (binary variable) i11103 - household income labor ijob1 - income from main job ijob2 - Income from secondary employment

The data is as follows:

 IncomeSet[IncomeSet["pglabgro"]<0].head() Out[32]: pid syear cid hid pglabgro emp_full emp_part emp_not \ 330416 2585401 2006 258547 258547 -2 0 0 1 330425 492301 2006 49239 49239 -2 0 0 1 330428 3069801 2006 306983 306983 -2 0 0 1 330429 2684101 2006 268410 268410 -2 0 0 1 330432 3038501 2006 303852 303852 -2 0 0 1 i11103 ijob1 ijob2 plc0013 330416 0 0 0 -2 330425 720 0 0 -2 330428 0 0 0 -2 330429 0 0 0 -2 330432 9396 9396 0 -2 

The value of the variable -2 means that the answer to the question: what is your current roll income? was not given. And the lack of data on this pglabgro variable is typical for those people who do not have work (emp_not = 1). But there are a lot of such values, and at the same time some people still have incomes from primary or secondary work or household income (i11103, ijob1, ijob2). Therefore, I would like to reduce the amount of missing data by assigning values ​​for the variable pglabgro from the variables i11103, ijob1 and ijob2.

for this, I wrote the following function:

 for x in list(IncomeSet["pglabgro"]): if IncomeSet[IncomeSet["pglabgro"].ix[x]<0]: if IncomeSet[IncomeSet["i11103"]>=0 & IncomeSet["i11103"]]<=0: if IncomeSet[IncomeSet["ijob1"]>=0 & IncomeSet["ijob1"]]<=0: if IncomeSet[IncomeSet["ijob2"]>=0 & IncomeSet["ijob2"]]<=0: IncomeSet["pglabgro"].ix[x]=-2 else: IncomeSet["pglabgro"].ix[x]==IncomeSet["ijob2"] else: IncomeSet["pglabgro"].ix[x]==IncomeSet["ijob1"] else: IncomeSet["pglabgro"].ix[x]==IncomeSet["i11103"] else: IncomeSet["pglabgro"].ix[x]==IncomeSet["pglabgro"] 

But I get the error: KeyError: 400 I don’t know yet how to improve the assignment logic. The result according to the data sample above should be as follows:

  pid syear cid hid pglabgro emp_full emp_part emp_not \ 330416 2585401 2006 258547 258547 -2 0 0 1 330425 492301 2006 49239 49239 720 0 0 1 330428 3069801 2006 306983 306983 -2 0 0 1 330429 2684101 2006 268410 268410 -2 0 0 1 330432 3038501 2006 303852 303852 9396 0 0 1 i11103 ijob1 ijob2 plc0013 330416 0 0 0 -2 330425 720 0 0 -2 330428 0 0 0 -2 330429 0 0 0 -2 330432 9396 9396 0 -2 

The logic was: If the value of the variable pglabgro from the list of values ​​of this variable is -2, then we look at the value of the household income variable - i11103. If It is non-zero, then we take its value, if not, then go to the variable ijob1 (income from the main job). If this variable is nonzero, then we assign the variable pglabgro its value, if it is zero, then go to the last variable ijob2 (Income from secondary employment), if it is zero, then the variable pglabgro remains with the value -2. If ijob2> 0, then assign its value to the variable pglabgro.

Thus, for pglabgro, we replace the values ​​of -2 according to the existing values ​​of the variables i11103, ijob1 and ijob2.

How to replace the values ​​of one variable with the values ​​of others in python pandas?

    1 answer 1

    The cause of the error in the incorrect use of the .ix[] method. Here is a reconstruction of your mistake for the example you cited:

     In [22]: IncomeSet Out[22]: pid syear cid hid pglabgro emp_full emp_part emp_not i11103 ijob1 ijob2 plc0013 330416 2585401 2006 258547 258547 -2 0 0 1 0 0 0 -2 330425 492301 2006 49239 49239 -2 0 0 1 720 0 0 -2 330428 3069801 2006 306983 306983 -2 0 0 1 0 0 0 -2 330429 2684101 2006 268410 268410 -2 0 0 1 0 0 0 -2 330432 3038501 2006 303852 303852 -2 0 0 1 9396 9396 0 -2 In [23]: list(IncomeSet["pglabgro"]) Out[23]: [-2, -2, -2, -2, -2] In [24]: x = -2 In [25]: IncomeSet["pglabgro"].ix[x] ... skipped ... KeyError: -2 

    In your case, the first line is pglabgro == 400 , so an error occurs: KeyError: 400

    I would replace all your for x in list(IncomeSet["pglabgro"]) loop on the next. three lines:

     In [60]: IncomeSet.ix[IncomeSet.eval('pglabgro == -2 and ijob1 > 0'), 'pglabgro'] = IncomeSet.ijob1 In [62]: IncomeSet.ix[IncomeSet.eval('pglabgro == -2 and ijob2 > 0'), 'pglabgro'] = IncomeSet.ijob2 In [63]: IncomeSet.ix[IncomeSet.eval('pglabgro == -2 and i11103 > 0'), 'pglabgro'] = IncomeSet.i11103 

    Result:

     In [64]: IncomeSet Out[64]: pid syear cid hid pglabgro emp_full emp_part emp_not i11103 ijob1 ijob2 plc0013 330416 2585401.0 2006.0 258547.0 258547.0 -2.0 0.0 0.0 1.0 0.0 0.0 0.0 -2.0 330425 492301.0 2006.0 49239.0 49239.0 720.0 0.0 0.0 1.0 720.0 0.0 0.0 -2.0 330428 3069801.0 2006.0 306983.0 306983.0 -2.0 0.0 0.0 1.0 0.0 0.0 0.0 -2.0 330429 2684101.0 2006.0 268410.0 268410.0 -2.0 0.0 0.0 1.0 0.0 0.0 0.0 -2.0 330432 3038501.0 2006.0 303852.0 303852.0 9396.0 0.0 0.0 1.0 9396.0 9396.0 0.0 -2.0 

    Some explanations:

     In [66]: IncomeSet.eval('pglabgro == -2 and ijob1 > 0') Out[66]: 330416 False 330425 False 330428 False 330429 False 330432 True dtype: bool In [69]: IncomeSet.ix[IncomeSet.eval('pglabgro == -2 and ijob1 > 0'), 'pglabgro'] = IncomeSet.ijob1 

    after the first assignment:

     In [70]: IncomeSet Out[70]: pid syear cid hid pglabgro emp_full emp_part emp_not i11103 ijob1 ijob2 plc0013 330416 2585401.0 2006.0 258547.0 258547.0 -2.0 0.0 0.0 1.0 0.0 0.0 0.0 -2.0 330425 492301.0 2006.0 49239.0 49239.0 -2.0 0.0 0.0 1.0 720.0 0.0 0.0 -2.0 330428 3069801.0 2006.0 306983.0 306983.0 -2.0 0.0 0.0 1.0 0.0 0.0 0.0 -2.0 330429 2684101.0 2006.0 268410.0 268410.0 -2.0 0.0 0.0 1.0 0.0 0.0 0.0 -2.0 330432 3038501.0 2006.0 303852.0 303852.0 9396.0 0.0 0.0 1.0 9396.0 9396.0 0.0 -2.0 In [71]: IncomeSet.eval('pglabgro == -2 and i11103 > 0') Out[71]: 330416 False 330425 True 330428 False 330429 False 330432 False dtype: bool 
    • As for example, for index 330432, are there values ​​for two variables i11103 and ijob1 at once? I thought that a consistent search of conditions could somehow cope. I look at the data and for some of the last values ​​(the tail () parameter) there are many NAN values. Therefore, first of all, I think it would be right to take the data from the ijob1 column, if the data in it is zero, contact ijob2, and if this variable is also 0, then take the data from i11103, if in the last NAN or 0, then leave - 2 for pglabgro. - user21
    • This is really, very elegant way. As I understood from the documentation, Python evaluates 'pglabgro == -2 and ijob1 > 0' in a string format as logical for the variable pglabgro . Is it possible to replace it with the same function only using logical operators? (I do not quite understand how this function works) If I understand correctly, then the logic is this, take the line for which 'pglabgro == -2 and ijob1 > 0' is correct and assign the value = IncomeSet.ijob1 , right? Why do we need the 'pglabgro' parameter after the comma? Thank you - user21
    • one
      @ user21, did not quite understand the question ... x.eval('c1 == -2 and c2 > 0') equivalent (and will be translated to): (x.c1 == -2) & (x.c2 > 0) - MaxU