Colleagues, I welcome you! Tell me, please, why when I run merge, some of my data is overwritten (NaN), depending on how?

enter image description here

    1 answer 1

    NaN appears when using "outer-joins" ( left , right , outer ) in the case when there is no data in the subordinate table in the "connected" column ( Person Responsible in your case). Those. NaN can be considered NULL in relational DBMS ...

    Example:

    Data:

     In [8]: a = pd.DataFrame({'ID':[1,2,3], 'A':[11,12,13]}) In [9]: b = pd.DataFrame({'ID':[3,4], 'B':[200,300]}) In [10]: a Out[10]: A ID 0 11 1 1 12 2 2 13 3 In [11]: b Out[11]: B ID 0 200 3 1 300 4 

    how='inner' (sql: inner join):

     In [12]: a.merge(b, on=['ID']) Out[12]: A ID B 0 13 3 200 

    how='left' (SQL: left outer join) - in the selection all the records from the "left table are a " and only those from the правой one that have a match. ID :

     In [13]: a.merge(b, on=['ID'], how='left') Out[13]: A ID B 0 11 1 NaN 1 12 2 NaN 2 13 3 200.0 

    how='right' (SQL: right outer join) - in the selection all records from the "right table - b " and only those from the левой , which have a matching. ID :

     In [15]: a.merge(b, on=['ID'], how='right') Out[15]: A ID B 0 13.0 3 200 1 NaN 4 300 

    how='outer' (SQL: full outer join) - in the selection all records from both tables:

     In [16]: a.merge(b, on=['ID'], how='outer') Out[16]: A ID B 0 11.0 1 NaN 1 12.0 2 NaN 2 13.0 3 200.0 3 NaN 4 300.0 

    But the most "interesting" begins when duplicates appear in your connected column (s) - all combinations of two tables will be in the result set:

     In [17]: a.loc[len(a)] = [14, 3] In [18]: b.loc[len(b)] = [400, 3] In [19]: a Out[19]: A ID 0 11 1 1 12 2 2 13 3 # duplicates for ID: 3 3 14 3 # duplicates for ID: 3 In [20]: b Out[20]: B ID 0 200 3 # duplicates for ID: 3 1 300 4 2 400 3 # duplicates for ID: 3 

    Inner join:

     In [21]: a.merge(b, on=['ID']) Out[21]: A ID B 0 13 3 200 # A: 13, B: 200 1 13 3 400 # A: 13, B: 400 2 14 3 200 # A: 14, B: 200 3 14 3 400 # A: 14, B: 400 

    Outer joins:

     In [22]: a.merge(b, on=['ID'], how='left') Out[22]: A ID B 0 11 1 NaN 1 12 2 NaN 2 13 3 200.0 3 13 3 400.0 4 14 3 200.0 5 14 3 400.0 In [23]: a.merge(b, on=['ID'], how='right') Out[23]: A ID B 0 13.0 3 200 1 14.0 3 200 2 13.0 3 400 3 14.0 3 400 4 NaN 4 300 In [24]: a.merge(b, on=['ID'], how='outer') Out[24]: A ID B 0 11.0 1 NaN 1 12.0 2 NaN 2 13.0 3 200.0 3 13.0 3 400.0 4 14.0 3 200.0 5 14.0 3 400.0 6 NaN 4 300.0 

    UPDATE:

    How to perform type conversions, for example: str -> int ?

    I know two ways of "vectorized" conversion of str to int :

     df_3['Person Responsible'] = pd.to_numeric(df_3['Person Responsible'], errors='coerce') 

    and

     df_3['Person Responsible'] = df_3['Person Responsible'].astype(int, raise_on_error=False) 

    in the first case, those values ​​that cannot be converted to numbers (for example, the string 'AAA' ) will be replaced by NaN and the entire column will have float64 , in the second case only those values ​​that can be converted and the entire column will be converted to or int if all values ​​were successfully converted, or the object will remain

    Converting int -> str easier:

     ft['Person Responsible'] = ft['Person Responsible'].astype(str) 
    • MaxU, thanks, but I think that's not the point. There is data that matches the key. According to my idea, all data that did not match in the Person Responsible key will be filtered by NaN and I will get a clean frame with completely filled series. Tell me, maybe the fact is that before that I also created the ft frame of two files with the merge method? - Denis Novik
    • Honestly, I don’t understand what to add to the above, but I haven’t solved this problem yet. I will postpone, until the onset of insight. Thanks MaxU. - Denis Novik
    • @DenisNovik, I would advise you to post small questions (3-5 lines each), but reproducible data (in text form, so that they can be copied and used in code) - for example, several matching and mismatched Person Responsible and show what you want to achieve. In general, merge() in Pandas is quite simple (not counting cases with duplicates) - it works in the same way as in relational DBMS - MaxU
    • MaxU, but tell me, please, maybe the reason is in dtype? I look in one frame in me. Person Responsible is of type int64, in the other a non-null object. - Denis Novik
    • one
      Once again, thank you very much for your help! The problem was solved by converting Person Responsible from int -> str. Everything is perfectly merged. Now I am going to meet the next stacks) - Denis Novik