Hello. Can you please explain how to write such a SQL query that finds all sessions during which the user performed the following actions, perhaps not in a row, but in the specified order:

  1. the user went to the rooms. homework-showcase;
  2. user went to rooms.view.step.content;
  3. user went to rooms.lesson.rev.step.content.

A session is a user activity in which less than one hour passes between successive actions. The session begins at the time of the first of these actions and ends an hour after the last of them.
The result should be the unloading of sessions of the form: user_id, <start date of the session>, <end date of the session>.

There is data of the following form:

user_id - happened_at - page,
where happened_at is the date of the action, and page is the type of the action (only rooms.homework-showcase and others are needed).

My request is now:

SELECT *, (case page when 'rooms.homework-showcase' then 1 when 'rooms.view.step.content' then 2 when 'rooms.lesson.rev.step.content' then 3 else 4 end) activity_number, extract('epoch' from happened_at) - extract('epoch' from lag(happened_at) OVER (PARTITION BY user_id ORDER BY happened_at)) time_lag FROM test.vimbox_pages WHERE (page IN ('rooms.homework-showcase', 'rooms.view.step.content', 'rooms.lesson.rev.step.content')) ORDER BY happened_at 

In it, I select only the necessary actions and using the window function lag I find the difference between the next and the previous time.

Now I use Python to further process the data received from the current request. python code I searched a lot on the Internet, but there only for the case where one entry - one day (hour) is considered and then solved with the help of window functions or self-join'ov But this approach will not work, because there can be records in the 1st and 119 th minute The difference in hours does not exceed 1, but this option is certainly not suitable, because by the minute the difference is more than 60

If not difficult, skip off the snippet immediately or tell where to read).
PS I use Redshift SQL, which is almost the same as PostgreSQL

  • The main idea is to pull out the current record and the date from the previous one in one record. But how exactly to do this is impossible to say until you tell which DBMS you use. Approaches to the implementation are fundamentally different in some MySQL where variables are used and most other databases, where functions like lag () are used or if there are no means, then subqueries are used to select the nearest date - Mike
  • If you want to get an adequate answer - specify the database, provide the input data not in the form of a picture, but in the form of text, so that you can quickly create a test table from it. And to her what exactly should come out of this data, it’s not clear from the current question what you want to do with these same groups - Mike
  • @Mike supplemented the question with initial data and the desired result. I use Redshift SQL, which is almost the same with PostgreSQL - Zhora Grigoriev
  • you most likely removed the data :) but it seems I began to understand ... but not completely. List all the signs of the end of the session: the first is clear - more than an hour has passed. but I somehow did not understand what was happening with the event numbers (1,2,3), they all should be in the session, or they should be strictly consecutive in time (i.e., if they go in time 2,1,3 - do not count it as a session) and that if something is missed, i.e. Only 1 and 3 events occurred - Mike
  • @Mike with actions can be dealt with later, but in general if there is a monotonic subsequence like 1 1 2 2 3 3 3, then the session is appropriate, and if for example all activities 2 1 1 2 3 1 are non-monotonic, then the session does not pass - Jora Grigoriev

1 answer 1

 select user_id, min(min_dt) min_dt, max(max_dt) max_dt, grp from ( select user_id, grp, act_grp, min(happened_at) min_dt, max(happened_at) max_dt, count(distinct activity_number) act_cnt from ( select *, sum(grp_start) over(partition by user_id order by happened_at) grp, sum(act_start) over(partition by user_id order by happened_at) act_grp from ( select *, case when coalesce(extract('epoch' from happened_at-lag(happened_at) over(partition by user_id order by happened_at)),0) < 3600 then 0 else 1 end grp_start, case when lag(activity_number) over(partition by user_id order by happened_at) > activity_number then 1 else 0 end act_start from pages ) X ) Y group by user_id, grp, act_grp ) Z group by user_id, grp having max(act_cnt)=3 

An example on sqlfiddle.com (In the absence of test data, we had to create them from another table, which is why there was only one suitable interval, and that provided 10 hour sessions).

Internal query ( X ) in addition to the columns of the source table creates signs of the beginning of new sessions ( grp_start ) on rows with a very different date. And a sign of the beginning of a new group of increasing event numbers ( act_start ). You need to either instead of pages in the request in the form of a subquery to put your select receiving the event numbers from their names, or instead of activity_number paint your case in several points). At the next level ( Y ), the numbering of groups-sessions and groups of increasing events is created. A level above ( Z ) we group records to groups of increasing events and get the number of unique events within the group. And finally, at the external level, we collect intervals up to sessions and select only those sessions in which there were event growth groups containing all 3 types of events.

  • Thank! It worked, though in Redshift SQL aggregation, when they are used as window functions, it was necessary to add up to sum(grp_start) over(partition by user_id order by happened_at rows between 1 following and 1 following) grp - Jora Grigoriev
  • @ ZhoraGrigorev But this is strange, this restriction of rows should give only 1 next entry. it somehow does not fit with the numbering of groups from the beginning of the sample. if this phrase is needed, it should be rows between unbounded preceding and current row - Mike
  • The problem, by the way, is incorrectly solved, the wrong set of resulting values ​​should turn out .... - Mishustiq