How to create a column in a DataFrame based on two columns from two different DataFrame

Question

data = pd.read_table(io.StringIO(''' geo_id TimeStamp 1 1380052543 3 1375703582 34 1374479873 3 1374479892 '''), sep='\s+') geo = pd.read_table(io.StringIO(''' id TZ_OFFSET 3 10800 34 0 1 7200 '''), sep='\s+')

The essence of the task is that I want to change the "time" column, adding data from the directory with the timestamp offset, and, accordingly, find a value in the directory that matches the client id from the first table. In short, you need to create a new column based on two columns from two different DataFrame, and these columns are connected, they have the same geo_id from the data frame and the id from the geo frame.

I want to receive:

 geo_id TimeStamp TZ_OFFSET 1 1380052543 7200 3 1375703582 10800 34 1374479873 0 3 1374479892 10800

text information is better to attach as text: a) easier to read;
You can correct the question text by clicking below to edit the question text

Accepted Answer · 2018-12-27T18:42:28

Vector solution:

mp - auxiliary Pandas.Series , which will be used for "mapping":

 mp = geo.assign(offset=pd.to_timedelta(geo.TZ_OFFSET.astype(str)+'S')).set_index('id')['offset'] data['date'] = pd.to_datetime(data['TimeStamp'], unit='s') + data['geo_id'].map(mp) data['TS'] = (pd.to_datetime(data['TimeStamp'], unit='s') + data['geo_id'].map(mp) ).astype('int64') // 10**9

PS date column was created for clarity. Judging by the output DF from the question, in the real script it will not be needed

Result:

 In [29]: data Out[29]: geo_id TimeStamp date TS 0 1 1380052543 2013-09-24 21:55:43 1380059743 1 3 1375703582 2013-08-05 14:53:02 1375714382 2 34 1374479873 2013-07-22 07:57:53 1374479873 3 3 1374479892 2013-07-22 10:58:12 1374490692

Check:

 In [30]: data.eval("Offset = TS - TimeStamp") Out[30]: geo_id TimeStamp date TS Offset 0 1 1380052543 2013-09-24 21:55:43 1380059743 7200 1 3 1375703582 2013-08-05 14:53:02 1375714382 10800 2 34 1374479873 2013-07-22 07:57:53 1374479873 0 3 3 1374479892 2013-07-22 10:58:12 1374490692 10800

UPDATE: if in the output DF you want to get the date / time with an offset in UNIX epoch / timestamp format, then it will be much easier to add TZ_OFFSET (offset in seconds) to TimeStamp (the number of seconds passed from 1970-01-01 00:00:00 UTC ):

 In [40]: data = (data.merge(geo, left_on='geo_id', right_on='id') .eval("TS = TimeStamp + TZ_OFFSET")) In [41]: data Out[41]: geo_id TimeStamp id TZ_OFFSET TS 0 1 1380052543 1 7200 1380059743 1 3 1375703582 3 10800 1375714382 2 3 1374479892 3 10800 1374490692 3 34 1374479873 34 0 1374479873

@MashaMaximova, To thank the author of the answer, vote for the answer or mark it as a solution.
I checked 2 options: with merge and c map, and the second one was working, but about merge: it replaced some geo_id values, with others where it shouldn’t have happened.
This result is related to the specifics of the work merge - this is a function that compares the values in pairs, and does not match the key-value pairs.

How to create a column in a DataFrame based on two columns from two different DataFrame

1 answer 1

More articles: