According to the requirements of the project, it was necessary to write several simple TS-data streams into the database. It is supposed to write data to a simple table:

type | id | taken_at | value click_count | region:spb | 2016-01-01 10:00:00 | 12 view_count | * | 2016-01-01 10:00:02 | {"anonymous": 12, "logged": 3} 

Naturally, we need to generate reports on this table, including with discretization: the user can indicate that he needs values ​​at a certain interval, for example, on one day, thirty minutes or two hours, that is, organize the following sample in pseudocode:

 results = ts.query() pointer = 0 filtered = [] last_date = 0 interval = 3600 // 1 час while results.length < pointer: if last_date + interval < results[pointer].taken_at: filtered[] = results[pointer] last_date = results[pointer].taken_at pointer++ return filtered 

Is it possible to organize such samples (include a line in the sample; skip lines while they are in the date range of the previous value + user-specified interval; repeat as long as the records remain) at the SQL level? A specific engine, if this is important - MySQL, ideally, of course, I would like to find a solution that does not depend on the engine.

  • 2
    And how in Russian will be "sampling"? - edem
  • Well, here I’ll go busting at least the interval to go, so it’s quite possible to bring the date to unix_time the number of seconds and take records with specific balances from division by 3600 for example - Mike
  • one
    If the remainder does not suit, then it will definitely depend on the engine. MySQL will just stand out. there are no window functions and CTEs in it, but there are variables, with the help of which you can mark lines by arbitrary logic and filter them in an external query. In other databases, a recursive CTE might help - Mike
  • @edem I find it difficult to correctly translate (otherwise I would have written right away). I mean the taking of individual values ​​from the total flow on certain conditions (in this case, on an interval). --- disregard this, translated! - etki
  • one
    @Etki common table expression. Interested in his recursive version. The first part should choose the first record, the second part should look for the next one. in fact, this cycle is what happens. And by the way, no one bothers to, for example, round up to an hour or the desired size and take one record from the rounded range with the same remainder of the division or window functions to number and take N-numbers - Mike

1 answer 1

We decided to use the offer from @Mike and some articles found through Google - we did round-robin storage, instead of which, of course, it was worth using the usual RRD at once, but we are limited in the choice of storage.

Each metric in the problem is characterized by four parameters:

  • Type (set of metrics of the same structure, created from samples with different filtering)
  • Identifier (inside type)
  • Metric data retention period (translated in seconds)
  • Interval with which metric values ​​should be taken (similarly translated into seconds)

Dividing the retention period (range) by interval (further interval), you can determine how many metrics will be stored within one range. Then, using simple mathematics with timestamps, you can divide the entire timeline into separate segments corresponding to intervals:

 range = 86400 timestamp | range # 0 | 0 86400 | 1 172800 | 2 ... 

The time inside this window is calculated by the usual mod.

 range_timestamp = timestamp() % range 

After that, in each range you can determine the sequence number of the measurement:

 sample_sequence_number = round(range_timestamp / interval) 

After that, the calculated sequence number is added to the primary key and allows you to store a limited (range / interval) number of records without the need to build a garbage collector.

The table has acquired the following form:

 type | id | sequence_number | taken_at | value click_count | region:spb | 0 | 2016-01-01 10:03:00 | 12 click_count | region:spb | 1 | 2016-01-01 10:01:00 | 5 click_count | region:spb | 2 | 2016-01-01 10:02:00 | 3 

In addition to a nice bonus with disposal of garbage collection, this scheme allows you to easily perform the above described samples with a period rounded to a multiple interval. For example, if the metric is removed every five minutes, and the user requested permission in 27 minutes, the engine will perform the following operations:

  • Rounds resolution to the nearest integer interval ( sampling = max(1, round(input / interval)) )
  • Perform the following query:

     SELECT * FROM statistics WHERE type = 'click_count' AND id = 'region:spb' AND sequence_number % sampling = 0 AND taken_at >= '2016-01-01 10:00' AND taken_at <= '2016-01-01 12:00' ORDER BY taken_at ASC 

Optionally, you can adjust the sampling to reflect the current date, but in our situation it is not necessary.