How to do a non-equi join in SQL with multiple time ranges per group?

Question

How to do a non-equi join in SQL with multiple time ranges per group?

31 views Asked by Emman At 16 March 2024 at 12:55

I'm looking for a way to do a non-equi join in SQL, joining by whether col x from table A is within a range of dates given in col y from table B. However, table B has multiple possible ranges per ID, and the table's data format is long. For example:

# Table A
# | person_id | breakfast_date | fruit_eaten_for_breakfast |
# |-----------|----------------|---------------------------|
# | 1         | 2023-03-12     | banana                    |
# | 1         | 2023-03-25     | apple                     |
# | 1         | 2023-04-01     | orange                    |
# | 1         | 2023-04-05     | kiwi                      |
# | 1         | 2023-04-22     | grapefruit                |
# | 2         | 2024-12-15     | strawberry                |
# | 2         | 2024-01-11     | blueberry                 |
# | 2         | 2024-02-12     | mango                     |
# | 2         | 2024-02-29     | watermelon                |
# | 2         | 2024-03-10     | pear                      |


# Table B
# | person_id | period_start_and_end | period |
# |-----------|----------------------|--------|
# | 1         | 2023-03-15           | 1      | first period for user_id = 1 started
# | 1         | 2023-03-30           | 1      | on March 15 and ended on March 30.
# | 1         | 2023-04-02           | 2      |
# | 1         | 2023-04-10           | 2      |
# | 1         | 2023-04-12           | 3      |
# | 1         | 2023-04-20           | 3      |
# | 2         | 2024-01-01           | 1      |
# | 2         | 2024-01-05           | 1      |
# | 2         | 2024-02-10           | 2      | second period for user_id = 2 started
# | 2         | 2024-02-13           | 2      | on Feb 10 and ended on Feb 13.

Note about table B : each person_id could have one or many periods, and we cannot know – i.e., we're agnostic to – how many periods per person, when writing the SQL query.

Expected Output

# | person_id | breakfast_date | fruit_eaten_for_breakfast | period |
# |-----------|----------------|---------------------------|--------|
# | 1         | 2023-03-25     | apple                     | 1      |
# | 1         | 2023-04-05     | kiwi                      | 2      |
# | 2         | 2024-02-12     | mango                     | 2      |

SQL Dialect

I work on AWS Athena that is based on Trino SQL.

Reproducible data

WITH

table_a AS (
  SELECT * FROM (VALUES
    (1, DATE('2023-03-12'), 'banana'),
    (1, DATE('2023-03-25'), 'apple'),
    (1, DATE('2023-04-01'), 'orange'),
    (1, DATE('2023-04-05'), 'kiwi'),
    (1, DATE('2023-04-22'), 'grapefruit'),
    (2, DATE('2024-12-15'), 'strawberry'),
    (2, DATE('2024-01-11'), 'blueberry'),
    (2, DATE('2024-02-12'), 'mango'),
    (2, DATE('2024-02-29'), 'watermelon'),
    (2, DATE('2024-03-10'), 'pear')
  ) AS t(person_id, breakfast_date, fruit_eaten_for_breakfast)
),

table_b AS (
  SELECT * FROM (VALUES
    (1, DATE('2023-03-15'), 1),
    (1, DATE('2023-03-30'), 1),
    (1, DATE('2023-04-02'), 2),
    (1, DATE('2023-04-10'), 2),
    (1, DATE('2023-04-12'), 3),
    (1, DATE('2023-04-20'), 3),
    (2, DATE('2024-01-01'), 1),
    (2, DATE('2024-01-05'), 1),
    (2, DATE('2024-02-10'), 2),
    (2, DATE('2024-02-13'), 2)
  ) AS t(person_id, period_start_and_end, period)
)

What I have tried so far

Well, not much. The regular non-equi join procedure I'm familiar with has the start date in one column and the end date in another column. But in the current case, we have multiple periods per person_id, and more problematic – we don't know how many. So even if I "pivot" the table to wide format, I still don't know how to figure out the multiple and unknown periods per person_id.

Original Q&A

There are 1 answers

**nbk** · Accepted Answer · 2024-03-16T13:24:11+00:00

You can pivot table_b to get the time ranges, that you need to join to table_a

WITH

table_a AS (
  SELECT * FROM (VALUES
    (1, DATE('2023-03-12'), 'banana'),
    (1, DATE('2023-03-25'), 'apple'),
    (1, DATE('2023-04-01'), 'orange'),
    (1, DATE('2023-04-05'), 'kiwi'),
    (1, DATE('2023-04-22'), 'grapefruit'),
    (2, DATE('2024-12-15'), 'strawberry'),
    (2, DATE('2024-01-11'), 'blueberry'),
    (2, DATE('2024-02-12'), 'mango'),
    (2, DATE('2024-02-29'), 'watermelon'),
    (2, DATE('2024-03-10'), 'pear')
  ) AS t(person_id, breakfast_date, fruit_eaten_for_breakfast)
),

table_b AS (
  SELECT * FROM (VALUES
    (1, DATE('2023-03-15'), 1),
    (1, DATE('2023-03-30'), 1),
    (1, DATE('2023-04-02'), 2),
    (1, DATE('2023-04-10'), 2),
    (1, DATE('2023-04-12'), 3),
    (1, DATE('2023-04-20'), 3),
    (2, DATE('2024-01-01'), 1),
    (2, DATE('2024-01-05'), 1),
    (2, DATE('2024-02-10'), 2),
    (2, DATE('2024-02-13'), 2)
  ) AS t(person_id, period_start_and_end, period)
), table_b_pivot as (
SELECT person_id,MIN(period_start_and_end) from_date, MAX(period_start_and_end) as To_date, period FROM table_b
GROUP BY person_id, period)
SELECT table_a.person_id, breakfast_date, fruit_eaten_for_breakfast, period
FROM table_a JOIN table_b_pivot 
  ON table_a.breakfast_date BETWEEN table_b_pivot.from_date AND  table_b_pivot.To_date
AND table_a.person_id = table_b_pivot.person_id
ORDER BY table_a.person_id,period

person_id	breakfast_date	fruit_eaten_for_breakfast	period
1	2023-03-25	apple	1
1	2023-04-05	kiwi	2
2	2024-02-12	mango	2

TechQA.

How to do a non-equi join in SQL with multiple time ranges per group?

Expected Output

SQL Dialect

Reproducible data

What I have tried so far

There are 1 answers

Related Questions in SQL

Related Questions in JOIN

Related Questions in AMAZON-ATHENA

Related Questions in PRESTO

Related Questions in TRINO

Popular Questions

Trending Questions