Recently I began to learn Python and now I’m puzzling over one task:
There are two dataframe:

user date/time a 01.02.2018 a 01.03.2018 a 15.03.2018 b 01.02.2018 b 02.02.2018 

and the second table

 user date/time a 01.01.2018 a 02.01.2018 a 02.02.2018 a 01.03.2018 a 14.03.2018 b 01.01.2018 

Is it possible to calculate without using cycles how many rows in the second table lie in the time range for each user from the first table? Those. for example, for user "a", calculate:

  1. How many rows in the second table that are earlier than 01.02.2018 ,
  2. which lie between 01.02.2018 and 01.03.2018 ,
  3. are between 01.03.2018 and 15.03.2018
  4. later 15.03.2018 .

In the end, I want to get something like this:

 user date/time count_in_table2 a before 01.02.2018 2 a 01.02.2018 1 a 01.03.2018 2 a after 15.03.2018 0 b before 01.02.2018 1 b 01.02.2018 0 b after 02.02.2018 0 

The only thing I came up with was the use of loops with a bunch of branches and conditions, but I understand that this is not a solution. I want to learn how to use the magic of pandas.

    1 answer 1

    Try this:

     def my_cut(data, dates, **kwargs): assert isinstance(data, pd.Series) assert isinstance(dates, pd.Series) dates = dates.sort_values() bins = pd.to_datetime([pd.to_datetime('1900-01-01')] + dates.tolist() + [pd.to_datetime('2200-01-01')]) dts = dates.dt.strftime('%Y-%m-%d').values labels = ('<= ' + dts).tolist() + ['after ' + dts[0]] return pd.cut(data, bins=bins, labels=labels, duplicates='drop', **kwargs) (d2.groupby(['user', d2.groupby('user')['date/time'] .apply(lambda g: my_cut(g, dates=d1.loc[d1.user==g.name, 'date/time'])) ], as_index=False) .size() .reset_index(name='count_in_table2')) 

    result:

      user date/time count_in_table2 0 a <= 2018-02-01 2 1 a <= 2018-03-01 2 2 a <= 2018-03-15 1 3 b <= 2018-02-01 1