I work with pandas. And I have a data set with the target variable pglabgro
. The list of variable values is given below:
pglabgro (=pglabgro)
- Current gross income in euros emp_full
- full employment (binary variable) emp_not
- unemployed (binary variable) i11103
- household income labor ijob1
- income from main job ijob2
- Income from secondary employment
The data is as follows:
IncomeSet[IncomeSet["pglabgro"]<0].head() Out[32]: pid syear cid hid pglabgro emp_full emp_part emp_not \ 330416 2585401 2006 258547 258547 -2 0 0 1 330425 492301 2006 49239 49239 -2 0 0 1 330428 3069801 2006 306983 306983 -2 0 0 1 330429 2684101 2006 268410 268410 -2 0 0 1 330432 3038501 2006 303852 303852 -2 0 0 1 i11103 ijob1 ijob2 plc0013 330416 0 0 0 -2 330425 720 0 0 -2 330428 0 0 0 -2 330429 0 0 0 -2 330432 9396 9396 0 -2
The value of the variable -2 means that the answer to the question: what is your current roll income? was not given. And the lack of data on this pglabgro variable is typical for those people who do not have work (emp_not = 1). But there are a lot of such values, and at the same time some people still have incomes from primary or secondary work or household income (i11103, ijob1, ijob2). Therefore, I would like to reduce the amount of missing data by assigning values for the variable pglabgro from the variables i11103, ijob1 and ijob2.
for this, I wrote the following function:
for x in list(IncomeSet["pglabgro"]): if IncomeSet[IncomeSet["pglabgro"].ix[x]<0]: if IncomeSet[IncomeSet["i11103"]>=0 & IncomeSet["i11103"]]<=0: if IncomeSet[IncomeSet["ijob1"]>=0 & IncomeSet["ijob1"]]<=0: if IncomeSet[IncomeSet["ijob2"]>=0 & IncomeSet["ijob2"]]<=0: IncomeSet["pglabgro"].ix[x]=-2 else: IncomeSet["pglabgro"].ix[x]==IncomeSet["ijob2"] else: IncomeSet["pglabgro"].ix[x]==IncomeSet["ijob1"] else: IncomeSet["pglabgro"].ix[x]==IncomeSet["i11103"] else: IncomeSet["pglabgro"].ix[x]==IncomeSet["pglabgro"]
But I get the error: KeyError: 400
I don’t know yet how to improve the assignment logic. The result according to the data sample above should be as follows:
pid syear cid hid pglabgro emp_full emp_part emp_not \ 330416 2585401 2006 258547 258547 -2 0 0 1 330425 492301 2006 49239 49239 720 0 0 1 330428 3069801 2006 306983 306983 -2 0 0 1 330429 2684101 2006 268410 268410 -2 0 0 1 330432 3038501 2006 303852 303852 9396 0 0 1 i11103 ijob1 ijob2 plc0013 330416 0 0 0 -2 330425 720 0 0 -2 330428 0 0 0 -2 330429 0 0 0 -2 330432 9396 9396 0 -2
The logic was: If the value of the variable pglabgro from the list of values of this variable is -2, then we look at the value of the household income variable - i11103. If It is non-zero, then we take its value, if not, then go to the variable ijob1 (income from the main job). If this variable is nonzero, then we assign the variable pglabgro its value, if it is zero, then go to the last variable ijob2 (Income from secondary employment), if it is zero, then the variable pglabgro remains with the value -2. If ijob2> 0, then assign its value to the variable pglabgro.
Thus, for pglabgro, we replace the values of -2 according to the existing values of the variables i11103, ijob1 and ijob2.
How to replace the values of one variable with the values of others in python pandas?