Colleagues, good afternoon!

Help solve the problem with the substitution of values ​​from the list in the column. There are 2 data frames (old and new) and a list:

spisok = ['Ivanov', 'Petrov', 'Sidorov']

df_old:

id score revie_date in_charge 111 4 08.10.2019 Petrov 123 2 04.03.2019 Sidorov 145 5 04.04.2019 Ivanov 135 6 20.05.2019 Petrov 222 5 25.06.2019 Sidorov 

df_new

 id score revie_date in_charge 367 6 18.07.2019 123 2 04.03.2019 257 5 04.06.2019 945 6 01.05.2019 222 5 25.06.2019 

The task is to assign an artist from spisok in an arbitrary order, but make it so that the tasks are distributed more or less equally (that is, there can be more than 1000 rows in the data frame).

Compare with the old list and if there are coincidences there by id, then take the executor / replace from the old list.

That the result was something like this:

df_new:

 id score revie_date in_charge 367 6 18.07.2019 Ivanov 123 2 04.03.2019 Sidorov 257 5 04.06.2019 Petrov 945 6 01.05.2019 Ivanov 222 5 25.06.2019 Sidorov 

I tried to do it through

 df_new['in_charge'] = np.random.choice(spisok, size=len(df_new)) 

but the result is completely not the same distributed equally and then how to compare with the previous df_old.

  • Do you want to make a uniform distribution of names before or after the corporation? Those. if you do this before copying, the result is unlikely to be a uniform distribution ... - MaxU
  • It is not really important here before or after ... preferably as a result so that there is no strong discrepancy in the number of people from the list - Pavel

1 answer 1

The np.random.choice () function allows you to specify the probabilities with which the corresponding elements from the list should be selected.

You can try this out - for this you need to calculate the probabilities of the elements from the list taking into account the already existing elements.

Example:

 names = ['a','b','c','d'] np.random.seed(321) old = pd.DataFrame({ 'id': np.arange(10), 'in_charge': np.random.choice(names, 10, p=[0.4, 0.25, 0.2, 0.15]) }) new = pd.DataFrame({'id': np.arange(100)}) 

First, add new values ​​for matching id :

 new['in_charge'] = new['id'].map(old.set_index('id')['in_charge']) 

distribution of values:

 In [75]: new['in_charge'].fillna('NaN').value_counts() Out[75]: NaN 90 d 4 a 3 b 2 c 1 Name: in_charge, dtype: int64 

calculate new probabilities:

 tgt_probs = pd.Series([1 / len(names)] * len(names), index=names) cur_probs = new['in_charge'].value_counts() / len(new) new_probs = (tgt_probs - cur_probs).clip(lower=0) / (new['in_charge'].isna().sum() / len(new)) new_probs = new_probs.fillna(1 / len(names)) 

happened:

 In [76]: cur_probs Out[76]: a 0.04 c 0.02 d 0.02 b 0.02 Name: in_charge, dtype: float64 In [77]: new_probs Out[77]: a 0.233333 b 0.255556 c 0.255556 d 0.255556 dtype: float64 

fill in the values ​​in accordance with the calculated probabilities:

 new.loc[new['in_charge'].isna(), 'in_charge'] = np.random.choice(names, new['in_charge'].isna().sum(), p=new_probs) 

result:

 In [80]: new['in_charge'].value_counts() Out[80]: a 28 d 28 b 26 c 18 Name: in_charge, dtype: int64 

PS it’s impossible to achieve a perfectly even distribution due to the nature of np.random.choice

  • Many thanks for the detailed and helpful answer! - Pavel