I'm looking for a way to do a non-equi join in SQL, joining by whether col x from table A is within a range of dates given in col y from table B. However, table B has multiple possible ranges per ID, and the table's data format is long. For example:
# Table A
# | person_id | breakfast_date | fruit_eaten_for_breakfast |
# |-----------|----------------|---------------------------|
# | 1 | 2023-03-12 | banana |
# | 1 | 2023-03-25 | apple |
# | 1 | 2023-04-01 | orange |
# | 1 | 2023-04-05 | kiwi |
# | 1 | 2023-04-22 | grapefruit |
# | 2 | 2024-12-15 | strawberry |
# | 2 | 2024-01-11 | blueberry |
# | 2 | 2024-02-12 | mango |
# | 2 | 2024-02-29 | watermelon |
# | 2 | 2024-03-10 | pear |
# Table B
# | person_id | period_start_and_end | period |
# |-----------|----------------------|--------|
# | 1 | 2023-03-15 | 1 | first period for user_id = 1 started
# | 1 | 2023-03-30 | 1 | on March 15 and ended on March 30.
# | 1 | 2023-04-02 | 2 |
# | 1 | 2023-04-10 | 2 |
# | 1 | 2023-04-12 | 3 |
# | 1 | 2023-04-20 | 3 |
# | 2 | 2024-01-01 | 1 |
# | 2 | 2024-01-05 | 1 |
# | 2 | 2024-02-10 | 2 | second period for user_id = 2 started
# | 2 | 2024-02-13 | 2 | on Feb 10 and ended on Feb 13.
Note about table B : each person_id could have one or many periods, and we cannot know – i.e., we're agnostic to – how many periods per person, when writing the SQL query.
Expected Output
# | person_id | breakfast_date | fruit_eaten_for_breakfast | period |
# |-----------|----------------|---------------------------|--------|
# | 1 | 2023-03-25 | apple | 1 |
# | 1 | 2023-04-05 | kiwi | 2 |
# | 2 | 2024-02-12 | mango | 2 |
SQL Dialect
I work on AWS Athena that is based on Trino SQL.
Reproducible data
WITH
table_a AS (
SELECT * FROM (VALUES
(1, DATE('2023-03-12'), 'banana'),
(1, DATE('2023-03-25'), 'apple'),
(1, DATE('2023-04-01'), 'orange'),
(1, DATE('2023-04-05'), 'kiwi'),
(1, DATE('2023-04-22'), 'grapefruit'),
(2, DATE('2024-12-15'), 'strawberry'),
(2, DATE('2024-01-11'), 'blueberry'),
(2, DATE('2024-02-12'), 'mango'),
(2, DATE('2024-02-29'), 'watermelon'),
(2, DATE('2024-03-10'), 'pear')
) AS t(person_id, breakfast_date, fruit_eaten_for_breakfast)
),
table_b AS (
SELECT * FROM (VALUES
(1, DATE('2023-03-15'), 1),
(1, DATE('2023-03-30'), 1),
(1, DATE('2023-04-02'), 2),
(1, DATE('2023-04-10'), 2),
(1, DATE('2023-04-12'), 3),
(1, DATE('2023-04-20'), 3),
(2, DATE('2024-01-01'), 1),
(2, DATE('2024-01-05'), 1),
(2, DATE('2024-02-10'), 2),
(2, DATE('2024-02-13'), 2)
) AS t(person_id, period_start_and_end, period)
)
What I have tried so far
Well, not much. The regular non-equi join procedure I'm familiar with has the start date in one column and the end date in another column. But in the current case, we have multiple periods per person_id, and more problematic – we don't know how many. So even if I "pivot" the table to wide format, I still don't know how to figure out the multiple and unknown periods per person_id.
You can pivot table_b to get the time ranges, that you need to join to table_a