In Trino/Presto SQL: Create a new column that accounts (enumerates) for change in sequence per group

40 views Asked by At

I want to calculate a new column that assigns an ID based on change in sequence per group and within that group.

Consider the following table:

# | user_id | my_date    | color  |
# |---------|------------|--------|
# | a       | 2023-02-01 | red    | ----
# | a       | 2023-03-22 | red    |    |
# | a       | 2023-03-30 | red    |    | this is period *1* for user_id = a
# | a       | 2023-06-10 | red    |    |
# | a       | 2023-06-11 | red    | ----
# | a       | 2023-07-03 | green  |
# | a       | 2023-07-09 | green  |
# | a       | 2024-01-11 | green  |
# | a       | 2024-02-11 | yellow |
# | a       | 2024-02-12 | yellow |
# | a       | 2024-02-13 | yellow |
# | a       | 2024-02-14 | yellow |
# | b       | 2022-10-20 | blue   |
# | b       | 2022-10-21 | blue   |
# | b       | 2022-10-22 | blue   |
# | b       | 2022-10-23 | brown  | ----
# | b       | 2022-10-24 | brown  |    | this is period *2* for user_id = b
# | b       | 2022-10-25 | brown  | ----
# | b       | 2022-10-26 | blue   |
# | b       | 2022-10-27 | blue   |

For each user_id, if we follow along my_date by order (ascending), we can see that there are "periods" or sequences that are characterized by the same color.

I want to create a new column that accounts for the period/sequence.

Expected output

# | user_id | my_date    | color  | period_number |
# |---------|------------|--------|---------------|
# | a       | 2023-02-01 | red    | 1             |
# | a       | 2023-03-22 | red    | 1             |
# | a       | 2023-03-30 | red    | 1             |
# | a       | 2023-06-10 | red    | 1             |
# | a       | 2023-06-11 | red    | 1             |
# | a       | 2023-07-03 | green  | 2             |
# | a       | 2023-07-09 | green  | 2             |
# | a       | 2024-01-11 | green  | 2             |
# | a       | 2024-02-11 | yellow | 3             |
# | a       | 2024-02-12 | yellow | 3             |
# | a       | 2024-02-13 | yellow | 3             |
# | a       | 2024-02-14 | yellow | 3             |
# | b       | 2022-10-20 | blue   | 1             |
# | b       | 2022-10-21 | blue   | 1             |
# | b       | 2022-10-22 | blue   | 1             |
# | b       | 2022-10-23 | brown  | 2             |
# | b       | 2022-10-24 | brown  | 2             |
# | b       | 2022-10-25 | brown  | 2             |
# | b       | 2022-10-26 | blue   | 3             |
# | b       | 2022-10-27 | blue   | 3             |

SQL Dialect

I use AWS Athena, based on Trino SQL.

Reproducible data

WITH my_table AS (
    SELECT *
    FROM (VALUES
        ('a', DATE '2023-02-01', 'red'),
        ('a', DATE '2023-03-22', 'red'),
        ('a', DATE '2023-03-30', 'red'),
        ('a', DATE '2023-06-10', 'red'),
        ('a', DATE '2023-06-11', 'red'),
        ('a', DATE '2023-07-03', 'green'),
        ('a', DATE '2023-07-09', 'green'),
        ('a', DATE '2024-01-11', 'green'),
        ('a', DATE '2024-02-11', 'yellow'),
        ('a', DATE '2024-02-12', 'yellow'),
        ('a', DATE '2024-02-13', 'yellow'),
        ('a', DATE '2024-02-14', 'yellow'),
        ('b', DATE '2022-10-20', 'blue'),
        ('b', DATE '2022-10-21', 'blue'),
        ('b', DATE '2022-10-22', 'blue'),
        ('b', DATE '2022-10-23', 'brown'),
        ('b', DATE '2022-10-24', 'brown'),
        ('b', DATE '2022-10-25', 'brown'),
        ('b', DATE '2022-10-26', 'blue'),
        ('b', DATE '2022-10-27', 'blue')
    ) AS t(user_id, my_date, color)
)

SELECT *
FROM my_table; 

UPDATE – One possible direction

From this answer I learned that I could mark the rows in which the change in color happens, per user_id:

breaking_points_marked AS (
SELECT *, 
       lag(color) OVER (PARTITION BY user_id ORDER BY my_date) IS DISTINCT FROM color AS is_starting_a_new_period
FROM   my_table
)

SELECT *
FROM breaking_points_marked 
ORDER BY user_id, 
         my_date; 

-- user_id|my_date   |color |is_starting_a_new_period|
-- -------+----------+------+------------------------+
-- a      |2023-02-01|red   |true                    |
-- a      |2023-03-22|red   |false                   |
-- a      |2023-03-30|red   |false                   |
-- a      |2023-06-10|red   |false                   |
-- a      |2023-06-11|red   |false                   |
-- a      |2023-07-03|green |true                    |
-- a      |2023-07-09|green |false                   |
-- a      |2024-01-11|green |false                   |
-- a      |2024-02-11|yellow|true                    |
-- a      |2024-02-12|yellow|false                   |
-- a      |2024-02-13|yellow|false                   |
-- a      |2024-02-14|yellow|false                   |
-- b      |2022-10-20|blue  |true                    |
-- b      |2022-10-21|blue  |false                   |
-- b      |2022-10-22|blue  |false                   |
-- b      |2022-10-23|brown |true                    |
-- b      |2022-10-24|brown |false                   |
-- b      |2022-10-25|brown |false                   |
-- b      |2022-10-26|blue  |true                    |
-- b      |2022-10-27|blue  |false                   |

Based on is_starting_a_new_period, the solution seems closer. I need to find a way to add another column that starts at 1 per user_id, and increments at each true (while it carries over the value along the falses).

I thought I had found the solution in this comment, but unfortunately I don't know how to adjust it to my situation.

SELECT *,
       1 + COUNT(CASE WHEN is_starting_a_new_period = TRUE THEN 1 END) OVER (PARTITION BY user_id 
                                                                             ORDER BY my_date 
                                                                             ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS period_number
FROM breaking_points_marked 
ORDER BY user_id, 
         my_date;

-- not the desired output :/

-- user_id|my_date   |color |is_starting_a_new_period|period_number|
-- -------+----------+------+------------------------+-------------+
-- a      |2023-02-01|red   |true                    |            1|
-- a      |2023-03-22|red   |false                   |            2|
-- a      |2023-03-30|red   |false                   |            2|
-- a      |2023-06-10|red   |false                   |            2|
-- a      |2023-06-11|red   |false                   |            2|
-- a      |2023-07-03|green |true                    |            2|
-- a      |2023-07-09|green |false                   |            3|
-- a      |2024-01-11|green |false                   |            3|
-- a      |2024-02-11|yellow|true                    |            3|
-- a      |2024-02-12|yellow|false                   |            4|
-- a      |2024-02-13|yellow|false                   |            4|
-- a      |2024-02-14|yellow|false                   |            4|
-- b      |2022-10-20|blue  |true                    |            1|
-- b      |2022-10-21|blue  |false                   |            2|
-- b      |2022-10-22|blue  |false                   |            2|
-- b      |2022-10-23|brown |true                    |            2|
-- b      |2022-10-24|brown |false                   |            3|
-- b      |2022-10-25|brown |false                   |            3|
-- b      |2022-10-26|blue  |true                    |            3|
-- b      |2022-10-27|blue  |false                   |            4|
1

There are 1 answers

0
ValNik On BEST ANSWER

See example

with my_table AS (
    SELECT *
    FROM (VALUES
        ('a',  '2023-02-01', 'red'),
        ('a',  '2023-03-30', 'red'),
        ('a',  '2023-06-10', 'red'),
        ('a',  '2023-06-11', 'red'),
        ('a',  '2023-07-03', 'green'),
        ('a',  '2023-07-09', 'green'),
        ('a',  '2024-01-11', 'green'),
        ('a',  '2024-02-11', 'yellow'),
        ('a',  '2024-02-12', 'yellow'),
        ('a',  '2024-02-13', 'yellow'),
        ('a',  '2024-02-14', 'yellow'),
        ('b',  '2022-10-20', 'blue'),
        ('b',  '2022-10-21', 'blue'),
        ('b',  '2022-10-22', 'blue'),
        ('b',  '2022-10-23', 'brown'),
        ('b',  '2022-10-24', 'brown'),
        ('b',  '2022-10-25', 'brown'),
        ('b',  '2022-10-26', 'blue'),
        ('b',  '2022-10-27', 'blue')
    ) AS t(user_id, my_date, color)
)

query

,periods as(select user_id,color,my_date
  ,(select min(my_date) from my_table t2 
    where t2.user_id=t.user_id and t2.my_date>t.my_date 
     and t2.color<>t.color) next_date
from my_table t
)
,gr as(
select user_id,color,next_date
  ,min(my_date)mindt 
from periods
group by user_id,color,next_date
)
,res as(
select t.* 
from my_table t
left join gr g on g.user_id=t.user_id
  and g.mindt<=t.my_date
)
select user_id,color,my_date,count(*) period_number
from res
group by user_id,color,my_date
order by user_id,my_date;

Update1.

If possible in your DBMS to use windows functions, the task becomes one of the gap and islands type tasks.
See example

with cte1 as(
select *
  ,case when color=lag(color,1,color)over(partition by user_id order by my_date)
     then 0
   else 1
   end new_period
from my_table
)
select * 
  ,sum(new_period)over(partition by user_id order by my_date)+1 period_number
from cte1

Update2.
There compared current color and previous color (within a sequence of rows by user_Id and date).

color=lag(color,1,color)over(partition by user_id order by my_date)

If color is changed - so, that is start of new period.
For first row for (user_Id, date) lag(color) returns null - previous row absent.
We use lag(color,1,color) - return previous color or return current color if no previous row.
So, first row of group has new_period=1.

Further,

 sum(new_period)over(partition by user_id order by my_date)+1 period_number

count of start new period is a number of period. +1 for numbering from 1, else period_number will 0,1,2....

Output

user_id my_date color new_period period_number
a 2023-02-01 red 0 1
a 2023-03-30 red 0 1
a 2023-06-10 red 0 1
a 2023-06-11 red 0 1
a 2023-07-03 green 1 2
a 2023-07-09 green 0 2
a 2024-01-11 green 0 2
a 2024-02-11 yellow 1 3
a 2024-02-12 yellow 0 3
a 2024-02-13 yellow 0 3
a 2024-02-14 yellow 0 3
b 2022-10-20 blue 0 1
b 2022-10-21 blue 0 1
b 2022-10-22 blue 0 1
b 2022-10-23 brown 1 2
b 2022-10-24 brown 0 2
b 2022-10-25 brown 0 2
b 2022-10-26 blue 1 3
b 2022-10-27 blue 0 3