I have a data set with the column " UserID ", ' System ' - the system that is used by the user and the concatenation of these two columns. Here is an example of the data :

 >>RolCatBR_IDMqes1.loc[0:11] UserID System CONCAT A 0 ANTANAS P1B_010, P2Z_010 P1B_010|ANTANAS 1 AWYGASC P1B_010, P2Z_010 P1B_010|AWYGASC 2 CHENQIA P1B_010, P2Z_010 P1B_010|CHENQIA 3 CHENQIA P3Z_020, P3Z_030 P3Z_020|CHENQIA 4 DBORZUT P1B_010, P2Z_010 P1B_010|DBORZUT 5 DURAKER P1B_010, P2Z_010 P1B_010|DURAKER 6 JEBINDE P1B_010, P2Z_010 P1B_010|JEBINDE 7 SMETTAN P1B_010, P2Z_010 P1B_010|SMETTAN 8 TKAUL13 P3Z_020, P3Z_030 P3Z_020|TKAUL13 9 VATERCH P3Z_020, P3Z_030 P3Z_020|VATERCH 10 ABUNNEN P2Z_010 P2Z_010|ABUNNEN 11 AMILSKI P2Z_010 P2Z_010|AMILSKI 

For example: the first line is [0] , I need to extract the data about the system - P2Z_010 , create a new line with the same UserID and put the system information - P2Z_010 with the updated CONCAT A

To get it:

  UserID System CONCAT A 0 ANTANAS P1B_010 P1B_010|ANTANAS 0.5 ANTANAS P2Z_010 P2Z_010|ANTANAS 1 AWYGASC P1B_010 P1B_010|AWYGASC 1.5 AWYGASC P2Z_010 P2Z_010|AWYGASC ... 

I tried to apply the method suggested by @Wen:

 s2 = RolCatBR_IDMqes1['System'].str.split(',') w2 = pd.DataFrame({ 'UserID':RolCatBR_IDMqes1['UserID'].repeat(s2.str.len().fillna(value=0).astype(int)), 'System':sum(s2.tolist(),[]), 'CONCATA':RolCatBR_IDMqes1['CONCATA'].repeat(s2.str.len().fillna(value=0).astype(int)) }) 

But I get an error and I do not know how to fix it:

 File "<ipython-input-93-42d2e6fcce42>", line 1, in <module> sum(s2.tolist(),[]) TypeError: can only concatenate list (not "float") to list 

How can I extract information from a variable cell and put it in a duplicate string? Or correct the error so that the method works?

Sample data for error reproduction:

df1 = df.iloc[1130:1140] df1 Out[79]: UserID System CONCAT A 1130 NaN NaN NaN 1131 AYNERDO P1B_010 P1B_010|AYNERDO 1132 CKIESCH P1B_010 P1B_010|CKIESCH 1133 JBRETTS P1B_010 P1B_010|JBRETTS 1134 YASSMAN P1B_010 P1B_010|YASSMAN 1135 EPFITZE P1B_010 P1B_010|EPFITZE 1136 NaN NaN NaN 1137 HUBBARA P1B_010 P1B_010|HUBBARA 1138 TQUINTO P1B_010 P1B_010|TQUINTO 1139 NaN NaN NaN

list(df1) Out[80]: ['UserID', 'System', 'CONCAT A']

Then I set the function in the system. And then I execute the code res = explode(df.assign(System=df['System'].str.split(',\s*', expand=False)), ['System'])

line 42, in _wrapit result = getattr(asarray(obj), method)(*args, **kwds)

TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'

I think I understand why this is happening, maybe it's in NaN. Now try to replace them with zero, or better to remove?

    1 answer 1

    You can use the explode() function :

     In [283]: res = explode(df.assign(System=df['System'].str.split(',\s*', expand=False)), ['System']) In [284]: res Out[284]: UserID System CONCAT A 0 ANTANAS P1B_010 P1B_010|ANTANAS 1 ANTANAS P2Z_010 P1B_010|ANTANAS 2 AWYGASC P1B_010 P1B_010|AWYGASC 3 AWYGASC P2Z_010 P1B_010|AWYGASC 4 CHENQIA P1B_010 P1B_010|CHENQIA 5 CHENQIA P2Z_010 P1B_010|CHENQIA 6 CHENQIA P3Z_020 P3Z_020|CHENQIA ... ... ... ... 14868 RKLESS P1B_010 P1B_010|RKLESS 14869 SARACHR P1B_010 P1B_010|SARACHR 14870 TGUNZEN P1B_010 P1B_010|TGUNZEN 14871 TSCHULTK P1B_010 P1B_010|TSCHULTK 14872 WEHEIL P1B_010 P1B_010|WEHEIL 14873 RSIELAF P1B_010 P1B_010|RSIELAF 14874 SCHUESA P3Z_020 P3Z_020|SCHUESA [14875 rows x 3 columns] 

    Some explanations:

    First you need to convert CSV values ​​to lists:

     In [287]: df['System'].str.split(',\s*', expand=False) Out[287]: 0 [P1B_010, P2Z_010] 1 [P1B_010, P2Z_010] 2 [P1B_010, P2Z_010] 3 [P3Z_020, P3Z_030] 4 [P1B_010, P2Z_010] ... 11695 [P1B_010] 11696 [P1B_010] 11697 [P1B_010] 11700 [P1B_010] 11701 [P3Z_020] Name: System, Length: 11643, dtype: object 

    same with replacing the System column in DF:

     In [288]: df.assign(System=df['System'].str.split(',\s*', expand=False)) Out[288]: UserID System CONCAT A 0 ANTANAS [P1B_010, P2Z_010] P1B_010|ANTANAS 1 AWYGASC [P1B_010, P2Z_010] P1B_010|AWYGASC 2 CHENQIA [P1B_010, P2Z_010] P1B_010|CHENQIA 3 CHENQIA [P3Z_020, P3Z_030] P3Z_020|CHENQIA 4 DBORZUT [P1B_010, P2Z_010] P1B_010|DBORZUT ... ... ... ... 11695 TGUNZEN [P1B_010] P1B_010|TGUNZEN 11696 TSCHULTK [P1B_010] P1B_010|TSCHULTK 11697 WEHEIL [P1B_010] P1B_010|WEHEIL 11700 RSIELAF [P1B_010] P1B_010|RSIELAF 11701 SCHUESA [P3Z_020] P3Z_020|SCHUESA [11643 rows x 3 columns] 

    Function code explode() :

     def explode(df, lst_cols, fill_value=''): # make sure `lst_cols` is a list if lst_cols and not isinstance(lst_cols, list): lst_cols = [lst_cols] # all columns except `lst_cols` idx_cols = df.columns.difference(lst_cols) # calculate lengths of lists lens = df[lst_cols[0]].str.len() if (lens > 0).all(): # ALL lists in cells aren't empty return pd.DataFrame({ col:np.repeat(df[col].values, lens) for col in idx_cols }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \ .loc[:, df.columns] else: # at least one list in cells is empty return pd.DataFrame({ col:np.repeat(df[col].values, lens) for col in idx_cols }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \ .append(df.loc[lens==0, idx_cols]).fillna(fill_value) \ .loc[:, df.columns] 
    • I don’t know what’s wrong, because I can’t get the same reselt even when I just execute the specified code. The following error appears: TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe', and if I convert in the function len of list, using .fillna(value=0).astype(int) , then an error occurs: ValueError: all the input arrays must have same number of dimensions
    • Try to give an example of the question in question, with which you can reproduce this error ... - MaxU
    • one
      @ user21, perhaps the problem is caused by empty lines in Excel. Try to read it like this: df = pd.read_excel(filename).dropna(how='all') - MaxU
    • one
      Thank you very much, I deleted NaN and there were no more errors, and the code was executed correctly. df2 = df.dropna(subset = ['UserID', 'System']) - user21