Good day.

Decided to use ClickHouse for analytics. I register about 30 000 000 events per month. I have several different types of events, all of them have a part of attributes that is the same, so I decided to create one main table and for each type of event my own additional one with a unique set of columns. But JOIN turned out to be overwhelming.

Now I think over two schemes

  • One big table in which there will be ALL columns of ALL types of events
  • each type of event will have its own table with a full set of columns

for example

  • event "visited page" has columns: date of visit, URL, REFER, user ID, session ID
  • event "put in the basket" has columns: date and time, product ID, quantity of goods, URL of the page, user ID, session ID
  • the event "removed goods from the cart" has columns: date and time, product ID, page URL, user ID, session ID

What is better to choose, can you advise?

  • one
    You need to build on the necessary analytics. If you need general analytics and sudden ad-hoc requests for "how many events the user commits to the basket", then the general flow, because it’s just impossible to detail. If you clearly know that you need to conduct a detailed analytics, then a separate table is created for this case, but you should leave the total flow. - etki
  • 2
    If anything, there is a thematic telegram chat, incl. Directly with developers and early adopters - telegram.me/clickhouse_ru - etki

1 answer 1

Letter from Yandex

Good day.

In our experience, the best option is one wide table with columns for all kinds of events. The sparseness of the table does not constitute a problem - the amount of data on the disk and performance does not suffer from this.

Execution of queries that use different types of events (example: calculating the conversion of visits to the basket) is seriously simplified - JOIN does not need to be done, just GROUP BY with aggregate functions with conditions of the form sumIf (..., event_type = 1).

It makes sense to do different tables in case there are different keys in the tables. Also, if one of the tables is significantly smaller than the others, but requests to it go more often, and higher speed is important for them, then such events can be distinguished into a separate table. In this case, you can record these events in a single, "merged" table, too.