How to find combination of intersection from many tables?

258 views Asked by At

I have a list of different channels that could potentially bring users to a website (organic, SEO, online marketing, etc.). I would like to find an efficient way to count daily active user that comes from the combination of these channels. Each channel has its own table and track its respective users.

The tables looks like the following,

channel A
date         user_id
2020-08-01   A
2020-08-01   B
2020-08-01   C

channel B
date         user_id
2020-08-01   C
2020-08-01   D
2020-08-01   G

channel C
date         user_id
2020-08-01   A
2020-08-01   C
2020-08-01   F

I want to know the following combinations

  1. Only visit channel A
  2. Only visit channel A & B
  3. Only visit channel B & C
  4. Only visit channel B
  5. etc.

However, when there are a lot of channels (I have around 8 channels) the combination is a lot. What I've done roughly is as simple as this (this one includes channel A)

SELECT 
    a.date, 
    COUNT(DISTINCT IF(b.user_id IS NULL AND c.user_id IS NULL, a.user_id, NULL)) AS dau_a,
    COUNT(DISTINCT IF(b.user_id IS NOT NULL AND c.user_id IS NULL, a.user_id, NULL)) AS dau_a_b,
    ...
FROM a LEFT JOIN b ON a.user_id = b.user_id AND a.date = b.date 
LEFT JOIN c ON a.user_id = c.user_id AND a.date = c.date
GROUP BY 1

but extremely tedious when the total channels is 8 (28 variations for 2 combinations, 56 for 3, 70 for 4, and many more).

Any smart ideas to solve this? I was thinking to use FULL OUTER JOIN but can't seem to get the grasp out of it. Answers really appreciated.

4

There are 4 answers

1
Gordon Linoff On BEST ANSWER

I would approach this with union all and two levels of aggregation:

select date, channels, count(*) as num_users
from (select date, user_id, string_agg(channel order by channel) as channels
      from ((select distinct date, user_id, 'a' as channel from a) union all
            (select distinct date, user_id, 'b' as channel from b) union all
            (select distinct date, user_id, 'c' as channel from c) 
           ) abc
      group by date, user_id
     ) c
group by date, channels;
  
1
GMB On

I am thinking full join and aggregation:

select date, a.channel_a, b.channel_b, c.channel_c, count(*) cnt
from      (select 'a' channel_a, a.* from channel_a) a
full join (select 'b' channel_b, b.* from channel_b b) b using (date, user_id)
full join (select 'c' channel_c, c.* from channel_c c) c using (date, user_id)
group by date, a.channel_a, b.channel_b, c.channel_c
2
Mikhail Berlyant On

However, when there are a lot of channels (I have around 8 channels) the combination is a lot

extremely tedious when the total channels is 8 (28 variations for 2 combinations, 56 for 3, 70 for 4, and many more).

Any smart ideas to solve this?

Below is for BigQuery Standard SQL and addresses exactly above aspect of the OP's concerns

#standardSQL
CREATE TEMP FUNCTION generate_combinations(a ARRAY<INT64>) 
RETURNS ARRAY<STRING>
LANGUAGE js AS '''
  var combine = function(a) {
    var fn = function(n, src, got, all) {
      if (n == 0) {
        if (got.length > 0) {
          all[all.length] = got;
        } return;
      }
      for (var j = 0; j < src.length; j++) {
        fn(n - 1, src.slice(j + 1), got.concat([src[j]]), all);
      } return;
    }
    var all = []; for (var i = 1; i < a.length; i++) {
      fn(i, a, [], all);
    }
    all.push(a);
    return all;
  } 
  return combine(a)
''';
with users as (
    select distinct date, user_id, 'A' channel from channel_A union all
    select distinct date, user_id, 'B' from channel_B union all
    select distinct date, user_id, 'C' from channel_C 
), visits as (
  select date, user_id, 
    string_agg(channel, ' & ' order by channel) combination
  from users
  group by date, user_id
), channels AS (
  select channel, cast(row_number() over(order by channel) as string) channel_num
  from (select distinct channel from users)
), combinations as (
  select string_agg(channel, ' & ' order by channel_num) combination
  from unnest(generate_combinations(generate_array(1,(select count(1) from channels)))) AS items, 
    unnest(split(items)) AS channel_num
  join channels using(channel_num)
  group by items
)
select date, 
  combination as channels_visited_only, 
  count(distinct user_id) dau
from visits
join combinations using (combination)
group by date, combination
order by combination

If to apply to sample data from your question - output is

enter image description here

Some explanations to help with using above

  • CTE users just simply union all tables and adds channel column to be able to distinguish from which table respective row came

  • CTE visits extracts list of all visited channels for each user-date combination

  • CTE channels just simply prepares list of channels and assigns number for later use

  • CTE combinations uses JS UDF to generate all combinations of channels' numbers and then joins them back to channels to generate channels combinations

  • and final SELECT statement is simply looks for those users whose list of visited channels match channels combination generated in previous step

Some recommendations for further streamlining above code

  • assuming your channel tables names follow channel_* pattern

you can use wildcard tables feature in users CTE and instead of

select distinct date, user_id, 'A' channel from channel_A union all
select distinct date, user_id, 'B' from channel_B union all
select distinct date, user_id, 'C' from channel_C 

you can use something like below - so just one line instead of as many lines as cannles you have

select distinct date, user_id, _TABLE_SUFFIX as channel from channel_*      
1
Martin Weitzmann On

I think you could use set operators to answer your questions: https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#set_operators

E.g.

  1. is (A except B) except C
  2. is A intersect B

etc.