Best way to query this information in ClickHouse, JOIN vs duplicated information

1.1k views Asked by At

I have the following issue.

I need to record a lot of information than can be group and filter in a lot of ways. That information happens to have steps so, lets imagine it has 4 steps, in which each step add more information to the previous step.

Currently I have 4 tables, one for each step with all the information of each step and the previous one.

So an example would be.

Step1 (transaction_id, date, country_id, device_id, browser_id, ip, language_id, target_id, etc) -> 40kk records

Step2 (transaction_id, date, country_id, device_id, browser_id, ip, language_id, target_id, step2_date, step2_ip, etc) -> 35kk records

Step3 (transaction_id, date, country_id, device_id, browser_id, ip, language_id, target_id, step2_date, step2_ip, step3_date, step3_ip, time_taken, etc) -> 5kk records

Step4 (transaction_id, date, country_id, device_id, browser_id, ip, language_id, target_id, step2_date, step2_ip, step3_date, step3_ip, time_taken, step4_date, step4_ip, final_value, etc) -> 1kk records

As you can see in the example each step has all the previous data plus new fields.

The ideal would be to have the first step with all the initial data (which is currently about 50 fields) and the other steps with the additional data of each step, and then JOIN by the transaction_id and be able to GROUP and FILTER.

I tried this approach, but as it has to iterate for every entry to match the transaction in the other steps it didn't seem too optimal. (it took about 3 seconds in querying as the other way took about 0,1 secs)

In the other hand, having all the available information in each table, allows me to filter the data previous to the UNION or JOIN and it's quite faster, but I requires to duplicate the data in each step..

I need to query one SELECT from the 4 steps that will tell me for example for a specific date and device_id and group by country_id how many step1, step2, step3, step4 are, and other flags that each step may have.

So, the question would be if there is any other way to speed up the relationship among steps so to avoid the duplication of information in each and mantain the speed? Maybe a way to indexing the transaction_id (that is unique and generated in the first step)

Lets consider that step1 table will receive about 100kk records per day, and the other one as much information.

Thank you.

1

There are 1 answers

0
Maciej Łopaciński On

I would keep all the data in a single table and then use AggregatingMergeTree/CollapsingMergeTree engines and/or aggregation functions like argMin/argMax or sequenceMatch.

I would avoid JOINs as they cost a lot, duplication seems to be best solution for OLAP case.