Colleagues, help to form a DataFrame based on a given condition.

Source DataFrame available:

ID №Policy Request Request date Decision 123 23ff 10000 2018-01-28 11:36 0 123 23ff 10000 2018-01-29 10:00 5000 123 42rd 25000 2018-06-18 15:10 25000 123 42rd 30000 2018-08-18 18:00 30000 345 23ff 15000 2018-01-28 12:00 10000 345 27fg 50000 2018-09-30 17:35 0 345 81er 30000 2018-09-30 10:15 10000 345 81er 30000 2018-10-20 11:30 10000 678 12rt 55000 2018-12-01 09:25 0 678 12rt 55000 2018-12-15 12:00 45000 

It is necessary to count the number of decisions (Decisions) taken for each ID in the frame No. Policy, however with the following restriction - if the decision on the ID within the same No. Policy was made within one month several times, then that decision is 1 ( i.e., within a month, one ID within one #Policy may have several solutions 2, 3 or more - if everything is done within one month, then regardless of the number of requests, you must assume that this is 1 solution).

The result should be approximately as follows

 ID №Policy Request Request date Decision count 123 23ff 10000 2018-01-28 11:36 0 0 123 23ff 10000 2018-01-29 10:00 5000 1 123 42rd 25000 2018-06-18 15:10 25000 1 123 42rd 30000 2018-08-18 18:00 30000 1 345 23ff 15000 2018-01-28 12:00 10000 1 345 27fg 50000 2018-09-30 17:35 0 1 345 81er 30000 2018-09-30 10:15 10000 0 345 81er 30000 2018-10-20 11:30 10000 1 678 12rt 55000 2018-12-01 09:25 0 0 678 12rt 55000 2018-12-15 12:00 45000 1 

What algorithm to register here mind I will not put: (

  • can you explain why in the resulting DF in the first line of count: 0 , and in the sixth: count: 1 ? - MaxU
  • in the first line 0, because after 1 day it was decided to repeat - No. Policy and ID coincide .... in the sixth line it was decided that the loan was not approved (0) - but the decision was made and because of this it is considered like 1 solution. The very essence is as follows - if a credit decision on the same client within the same contract (No. Policy) was made several times within 30 days, then it should be considered as 1 decision ... - Pavel
  • it would be much easier to aggregate the lines so that in the end there is one line for each ID , NPolicy , Request_month - MaxU
  • do you mean remove duplicate within the same period? something like drop_duplicates? - Pavel
  • In the current formulation of the problem, this is difficult to implement, because the logic for calculating count different. If we always started count from 1 and in all subsequent lines for the same ID and NPolicy we would put 0 for the same month. Then the logic would be the same and implement such logic - easier - MaxU

1 answer 1

If I understand the question correctly:

 In [209]: df['count'] = (df.groupby(['ID','NPolicy',pd.Grouper(key='Request_date', freq='MS')]) ['Decision'] .cumcount().eq(0).astype('int')) In [210]: df Out[210]: ID NPolicy Request Request_date Decision count 0 123 23ff 10000 2018-01-28 11:36:00 0 1 1 123 23ff 10000 2018-01-29 10:00:00 5000 0 2 123 42rd 25000 2018-06-18 15:10:00 25000 1 3 123 42rd 30000 2018-08-18 18:00:00 30000 1 4 345 23ff 15000 2018-01-28 12:00:00 10000 1 5 345 27fg 50000 2018-09-30 17:35:00 0 1 6 345 81er 30000 2018-09-30 10:15:00 10000 1 7 345 81er 30000 2018-10-20 11:30:00 10000 1 8 678 12rt 55000 2018-12-01 09:25:00 0 1 9 678 12rt 55000 2018-12-15 12:00:00 45000 0 
  • Thanks, everything seems to be correct. And such a question if I want to change the period, say, take not within a month, but within 1.5 months or say 20 days ... how is it possible to do this? - Pavel
  • here is the complete table "offset aliases" that can be used in the freq parameter. You can also specify the number of periods, for example freq='10D' or freq='2W' . If you can't figure it out - ask a new question here - let's try to figure it out together;) - MaxU
  • one
    Thank you so much !!!!! It really helped, as always :) - Pavel