There is raw data of the history of using objects in the form of a list of sesst-triples (id объекта, время начала, время окончания)

 [(1, "2012-09-20 00:00:00+04", "2012-09-20 05:00:00+04"), (1, "2012-09-20 07:30:00+04", "2012-09-20 09:25:00+04"), (2, "2012-09-20 07:00:00+04", "2012-09-20 09:15:00+04")] 

That is, in this example, object 1 was used twice, from 00:00 to 05:00 and from 07:30 to 09:25, and object 2 - from 07:00 to 09:15. The list is now sorted by object ID and then by increasing time, but you can submit data in any form - in general, this is a table in the SQL RDBMS.

Records as a rule should not, but, theoretically, can intersect - that is, it may happen that it will be [(1, "…", "… 12:00:37"), (1, "… 11:59:42", "…")] , and in this case we can assume that the use in the area of ​​12 hours was not interrupted.

I want to receive from this list a time series, conditionally, of the following form:

 [("2012-09-20 00:00:00+04", {1}), ("2012-09-20 05:00:00+04", {}), ("2012-09-20 07:00:00+04", {2}), ("2012-09-20 07:30:00+04", {1,2}), ("2012-09-20 09:15:00+04", {1}), ("2012-09-20 09:25:00+04", {})] 

Those. from 00:00 (the minimum date-time in the history) object 1 was used, then at 05:00 nothing, then from 07:00 - object 2, then both objects, etc., until 09:25, for which data are over.

Please tell me a good algorithm and data structures for quickly performing such a conversion. Volumes - up to 10,000 objects, for periods of varying duration (day, week, month, 3 months, 6 months, year, no longer interesting), up to 100,000 entries per day.

CPU time pity, memory - any number, within physically reasonable limits.

The obvious “greedy” algorithm with a run over the time interval is no good - the algorithm should not depend directly on the duration of the interval in question, only on the number of records in the history.

  • judging by the number of records per day, there will be more than seconds in a day, the mileage over the time interval does not look so bad. - Yura Ivanov
  • This is at the peak, if suddenly the system began to "jerk" and she began to actively finish and immediately start the session again. The usual procedure is something between 5,000–10,000 records / day. - drdaeman
  • one
    It seems that the result can be achieved by request. We need to think about it. At least precisely it is possible to get rid of the intersection of intervals for one object. - Yura Ivanov

3 answers 3

  • In such a formulation, if I interpreted everything correctly, it sounds like a typical use case for a data structure called Interval Tree. The algorithm might look something like this:

http://i.imgur.com/i1dKa.png


  • Lax justification and estimation of the complexity of the algorithm:
  • It is clear that a change in the set of currently running objects can only occur at the start or end for some triple, so the UNIQUE approach should work.

  • The running time for n input elements will be O(n log n) + O(n) + O(n * (log n + k)) = O(n * (log n + k)) , where k is the number of elements with overlapping intervals, because the complexity of the insertion is O(log n), and the cost of the query is O(log n + k) .

  • Algorithmic complexity of operations for Interval Tree I spied here.

  • In due time used boost::icl for the similar purposes.
  • Yes, it looks like (I understand that with metadata is it usually called interval maps?) Just what you need. At least very similar. Thank! - drdaeman
  • @drdaeman Yes, but there are tricky rules for aggregating metadata. In general, the ICL clearly written by a man with an academic mindset :) - Costantino Rupert

I got this query:

 select l1.begindate a, group_concat(l2.obj), 1 from logs l1 join logs l2 on l2.begindate <= l1.begindate and l2.enddate>=l1.begindate group by a union all select l1.enddate a, group_concat(l2.obj), 2 from logs l1 left join logs l2 on l2.begindate<=l1.enddate and l2.enddate>l1.enddate group by a order by a; 

I did not check for intersection of intervals ...

  • one
    Some problem is the use of a specific GROUP_CONCAT - renegator
  • it is not specific, in many subd there is either an analogue, or similar functionality can be achieved through the procedures ... go and see. - Yura Ivanov

Well, if it is stated that this is a SQL database, then it will just deal with it. The list of objects used at each time point is replaced by their number. If necessary, they are also easy to get.

 (SELECT t1.start , count(*) FROM table t1 JOIN table t2 ON t2.start <= t1.start AND t2.finish > t1.start GROUP BY t1.start ) UNION (SELECT t3.finish , count(*) FROM table t3 JOIN table t4 ON t4.start <= t3.finish AND t4.finish > t3.finish GROUP BY t3.finish) UNION (SELECT t5.finish , 0 FROM table t5 WHERE NOT EXISTS ( SELECT * FROM table t6 WHERE t5.finish > t6.start AND t5.finish < t6.finish ) ) ORDER BY 1 
  • +1 fine. - Yura Ivanov