Why the data part is lost at merge of two DataFrames?

Question

Colleagues, I welcome you! Tell me, please, why when I run merge, some of my data is overwritten (NaN), depending on how?

Accepted Answer · 2017-01-31T13:50:56

NaN appears when using "outer-joins" ( left , right , outer ) in the case when there is no data in the subordinate table in the "connected" column ( Person Responsible in your case). Those. NaN can be considered NULL in relational DBMS ...

Example:

Data:

 In [8]: a = pd.DataFrame({'ID':[1,2,3], 'A':[11,12,13]}) In [9]: b = pd.DataFrame({'ID':[3,4], 'B':[200,300]}) In [10]: a Out[10]: A ID 0 11 1 1 12 2 2 13 3 In [11]: b Out[11]: B ID 0 200 3 1 300 4

how='inner' (sql: inner join):

 In [12]: a.merge(b, on=['ID']) Out[12]: A ID B 0 13 3 200

how='left' (SQL: left outer join) - in the selection all the records from the "left table are a " and only those from the правой one that have a match. ID :

 In [13]: a.merge(b, on=['ID'], how='left') Out[13]: A ID B 0 11 1 NaN 1 12 2 NaN 2 13 3 200.0

how='right' (SQL: right outer join) - in the selection all records from the "right table - b " and only those from the левой , which have a matching. ID :

 In [15]: a.merge(b, on=['ID'], how='right') Out[15]: A ID B 0 13.0 3 200 1 NaN 4 300

how='outer' (SQL: full outer join) - in the selection all records from both tables:

 In [16]: a.merge(b, on=['ID'], how='outer') Out[16]: A ID B 0 11.0 1 NaN 1 12.0 2 NaN 2 13.0 3 200.0 3 NaN 4 300.0

But the most "interesting" begins when duplicates appear in your connected column (s) - all combinations of two tables will be in the result set:

 In [17]: a.loc[len(a)] = [14, 3] In [18]: b.loc[len(b)] = [400, 3] In [19]: a Out[19]: A ID 0 11 1 1 12 2 2 13 3 # duplicates for ID: 3 3 14 3 # duplicates for ID: 3 In [20]: b Out[20]: B ID 0 200 3 # duplicates for ID: 3 1 300 4 2 400 3 # duplicates for ID: 3

Inner join:

 In [21]: a.merge(b, on=['ID']) Out[21]: A ID B 0 13 3 200 # A: 13, B: 200 1 13 3 400 # A: 13, B: 400 2 14 3 200 # A: 14, B: 200 3 14 3 400 # A: 14, B: 400

Outer joins:

 In [22]: a.merge(b, on=['ID'], how='left') Out[22]: A ID B 0 11 1 NaN 1 12 2 NaN 2 13 3 200.0 3 13 3 400.0 4 14 3 200.0 5 14 3 400.0 In [23]: a.merge(b, on=['ID'], how='right') Out[23]: A ID B 0 13.0 3 200 1 14.0 3 200 2 13.0 3 400 3 14.0 3 400 4 NaN 4 300 In [24]: a.merge(b, on=['ID'], how='outer') Out[24]: A ID B 0 11.0 1 NaN 1 12.0 2 NaN 2 13.0 3 200.0 3 13.0 3 400.0 4 14.0 3 200.0 5 14.0 3 400.0 6 NaN 4 300.0

UPDATE:

How to perform type conversions, for example: str -> int ?

I know two ways of "vectorized" conversion of str to int :

 df_3['Person Responsible'] = pd.to_numeric(df_3['Person Responsible'], errors='coerce')

and

 df_3['Person Responsible'] = df_3['Person Responsible'].astype(int, raise_on_error=False)

in the first case, those values that cannot be converted to numbers (for example, the string 'AAA' ) will be replaced by NaN and the entire column will have float64 , in the second case only those values that can be converted and the entire column will be converted to or int if all values were successfully converted, or the object will remain

Converting int -> str easier:

 ft['Person Responsible'] = ft['Person Responsible'].astype(str)

According to my idea, all data that did not match in the Person Responsible key will be filtered by NaN and I will get a clean frame with completely filled series.
Tell me, maybe the fact is that before that I also created the ft frame of two files with the merge method?
Honestly, I don’t understand what to add to the above, but I haven’t solved this problem yet.
@DenisNovik, I would advise you to post small questions (3-5 lines each), but reproducible data (in text form, so that they can be copied and used in code) - for example, several matching and mismatched Person Responsible and show what you want to achieve.
In general, merge() in Pandas is quite simple (not counting cases with duplicates) - it works in the same way as in relational DBMS
I look in one frame in me. Person Responsible is of type int64, in the other a non-null object.
The problem was solved by converting Person Responsible from int -> str.

Why the data part is lost at merge of two DataFrames?

1 answer 1

More articles: