Amount of clients which have seen an impression of a product before purchasing - most optimal answer for Billion-record table

30 views Asked by At

for a project I am working on I was asked to solve this task:

Given the table

events event_id int (autoincrement) --10B distinct values event_ts datetime -- 10B event_type int (1 = impression, 2 = click, 3 = purchase...) --20 types product_id int --100K client_id int --10M client_type int --10

Q: find the amount of clients which have seen an impression of a product before purchasing it

I came up with two solutions:

1)

With cteProdsClients As (
select e1.product_id ,e1.client_id 
from  events as e1
where event_type = ‘3’ and 

EXISTS (SELECT  e2.product_id ,e2.client_id 
              FROM events as e2
              WHERE event_type = ‘1’ and    e1.product_id  = e2.product_id 
                        AND e1.client_id  = e2.client_id
AND e1.event_ts  <e2.event_ts )

count(distinct product_id)
From cteProdsClients                    
)
SELECT
    count(distinct client_id)
FROM 
    cteProdsClients ;



2)

With cteProdsClients As (
select e1.product_id ,e1.client_id 
from  events as e1 left join 
(
  SELECT  e2.product_id ,e2.client_id 
              FROM events as e2   
 WHERE event_type = ‘1’  AND e1.event_ts  <e2.event_ts
)  
ON e1.product_id  = e2.product_id  AND e1.client_id  = e2.client_id
WHERE event_type = ‘3’ 
)
SELECT
    count(distinct client_id)
FROM 
    cteProdsClients ;

I need to not only create a query that would get the output but also do it in the most efficient, optimized way. Out of these 2 which is better? I would appreciate it if you fixed errors (if exist) and propose better solutions Thanks

1

There are 1 answers

5
NickW On

You seem to be overcomplicating things; I believe this will give you what you want. Performance would be dependent on appropriate indexing, etc.

SELECT COUNT(DISTINCT E1.CLIENT_ID)
FROM EVENTS E1
INNER JOIN EVENTS E2 ON 
    E1.CLIENT_ID = E2.CLIENT_ID AND
    E1.PRODUCT_ID = E2.PRODUCT_ID AND
    E1.EVENT_TS < E2.EVENT_TS AND 
    E2.EVENT_TYPE = '3'
WHERE E1.EVENT_TYPE = '1'