I want to calculate a new column that assigns an ID based on change in sequence per group and within that group.
Consider the following table:
# | user_id | my_date | color |
# |---------|------------|--------|
# | a | 2023-02-01 | red | ----
# | a | 2023-03-22 | red | |
# | a | 2023-03-30 | red | | this is period *1* for user_id = a
# | a | 2023-06-10 | red | |
# | a | 2023-06-11 | red | ----
# | a | 2023-07-03 | green |
# | a | 2023-07-09 | green |
# | a | 2024-01-11 | green |
# | a | 2024-02-11 | yellow |
# | a | 2024-02-12 | yellow |
# | a | 2024-02-13 | yellow |
# | a | 2024-02-14 | yellow |
# | b | 2022-10-20 | blue |
# | b | 2022-10-21 | blue |
# | b | 2022-10-22 | blue |
# | b | 2022-10-23 | brown | ----
# | b | 2022-10-24 | brown | | this is period *2* for user_id = b
# | b | 2022-10-25 | brown | ----
# | b | 2022-10-26 | blue |
# | b | 2022-10-27 | blue |
For each user_id, if we follow along my_date by order (ascending), we can see that there are "periods" or sequences that are characterized by the same color.
I want to create a new column that accounts for the period/sequence.
Expected output
# | user_id | my_date | color | period_number |
# |---------|------------|--------|---------------|
# | a | 2023-02-01 | red | 1 |
# | a | 2023-03-22 | red | 1 |
# | a | 2023-03-30 | red | 1 |
# | a | 2023-06-10 | red | 1 |
# | a | 2023-06-11 | red | 1 |
# | a | 2023-07-03 | green | 2 |
# | a | 2023-07-09 | green | 2 |
# | a | 2024-01-11 | green | 2 |
# | a | 2024-02-11 | yellow | 3 |
# | a | 2024-02-12 | yellow | 3 |
# | a | 2024-02-13 | yellow | 3 |
# | a | 2024-02-14 | yellow | 3 |
# | b | 2022-10-20 | blue | 1 |
# | b | 2022-10-21 | blue | 1 |
# | b | 2022-10-22 | blue | 1 |
# | b | 2022-10-23 | brown | 2 |
# | b | 2022-10-24 | brown | 2 |
# | b | 2022-10-25 | brown | 2 |
# | b | 2022-10-26 | blue | 3 |
# | b | 2022-10-27 | blue | 3 |
SQL Dialect
I use AWS Athena, based on Trino SQL.
Reproducible data
WITH my_table AS (
SELECT *
FROM (VALUES
('a', DATE '2023-02-01', 'red'),
('a', DATE '2023-03-22', 'red'),
('a', DATE '2023-03-30', 'red'),
('a', DATE '2023-06-10', 'red'),
('a', DATE '2023-06-11', 'red'),
('a', DATE '2023-07-03', 'green'),
('a', DATE '2023-07-09', 'green'),
('a', DATE '2024-01-11', 'green'),
('a', DATE '2024-02-11', 'yellow'),
('a', DATE '2024-02-12', 'yellow'),
('a', DATE '2024-02-13', 'yellow'),
('a', DATE '2024-02-14', 'yellow'),
('b', DATE '2022-10-20', 'blue'),
('b', DATE '2022-10-21', 'blue'),
('b', DATE '2022-10-22', 'blue'),
('b', DATE '2022-10-23', 'brown'),
('b', DATE '2022-10-24', 'brown'),
('b', DATE '2022-10-25', 'brown'),
('b', DATE '2022-10-26', 'blue'),
('b', DATE '2022-10-27', 'blue')
) AS t(user_id, my_date, color)
)
SELECT *
FROM my_table;
UPDATE – One possible direction
From this answer I learned that I could mark the rows in which the change in color happens, per user_id:
breaking_points_marked AS (
SELECT *,
lag(color) OVER (PARTITION BY user_id ORDER BY my_date) IS DISTINCT FROM color AS is_starting_a_new_period
FROM my_table
)
SELECT *
FROM breaking_points_marked
ORDER BY user_id,
my_date;
-- user_id|my_date |color |is_starting_a_new_period|
-- -------+----------+------+------------------------+
-- a |2023-02-01|red |true |
-- a |2023-03-22|red |false |
-- a |2023-03-30|red |false |
-- a |2023-06-10|red |false |
-- a |2023-06-11|red |false |
-- a |2023-07-03|green |true |
-- a |2023-07-09|green |false |
-- a |2024-01-11|green |false |
-- a |2024-02-11|yellow|true |
-- a |2024-02-12|yellow|false |
-- a |2024-02-13|yellow|false |
-- a |2024-02-14|yellow|false |
-- b |2022-10-20|blue |true |
-- b |2022-10-21|blue |false |
-- b |2022-10-22|blue |false |
-- b |2022-10-23|brown |true |
-- b |2022-10-24|brown |false |
-- b |2022-10-25|brown |false |
-- b |2022-10-26|blue |true |
-- b |2022-10-27|blue |false |
Based on is_starting_a_new_period, the solution seems closer. I need to find a way to add another column that starts at 1 per user_id, and increments at each true (while it carries over the value along the falses).
I thought I had found the solution in this comment, but unfortunately I don't know how to adjust it to my situation.
SELECT *,
1 + COUNT(CASE WHEN is_starting_a_new_period = TRUE THEN 1 END) OVER (PARTITION BY user_id
ORDER BY my_date
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS period_number
FROM breaking_points_marked
ORDER BY user_id,
my_date;
-- not the desired output :/
-- user_id|my_date |color |is_starting_a_new_period|period_number|
-- -------+----------+------+------------------------+-------------+
-- a |2023-02-01|red |true | 1|
-- a |2023-03-22|red |false | 2|
-- a |2023-03-30|red |false | 2|
-- a |2023-06-10|red |false | 2|
-- a |2023-06-11|red |false | 2|
-- a |2023-07-03|green |true | 2|
-- a |2023-07-09|green |false | 3|
-- a |2024-01-11|green |false | 3|
-- a |2024-02-11|yellow|true | 3|
-- a |2024-02-12|yellow|false | 4|
-- a |2024-02-13|yellow|false | 4|
-- a |2024-02-14|yellow|false | 4|
-- b |2022-10-20|blue |true | 1|
-- b |2022-10-21|blue |false | 2|
-- b |2022-10-22|blue |false | 2|
-- b |2022-10-23|brown |true | 2|
-- b |2022-10-24|brown |false | 3|
-- b |2022-10-25|brown |false | 3|
-- b |2022-10-26|blue |true | 3|
-- b |2022-10-27|blue |false | 4|
See example
query
Update1.
If possible in your DBMS to use windows functions, the task becomes one of the
gap and islandstype tasks.See example
Update2.
There compared current color and previous color (within a sequence of rows by user_Id and date).
If color is changed - so, that is start of new period.
For first row for (user_Id, date) lag(color) returns null - previous row absent.
We use lag(color,1,color) - return previous color or return current color if no previous row.
So, first row of group has
new_period=1.Further,
count of
start new periodis a number of period.+1for numbering from 1, else period_number will0,1,2....Output