Snowflake Analytical Query Design

163 views Asked by At

I have a tricky query design requirement, i have tried different types/different combination of Analytical function to achieve my result from the below data set. My other plan is do write stored proc, however i want to reach out to the expert group before i change my direction.

Input Data Set:

enter image description here

Required Output Data Set with the Group Column: When there is a session id change in the session id and if i get back the same session id again, i have to have a different group to it. I tried to use LEAD/LAG combination, however unable to get the below desired output, one or other scenario was breaking.

enter image description here

Thanks !

3

There are 3 answers

2
Lukasz Szozda On BEST ANSWER

SQL language is expressive enough to find a declarative solution for complex requirements.

Snowflake has recently implemented SQL 2016 Standard clause: MATCH_RECOGNIZE, which was designed to solve such cases in very straighforward way.

Identifying Sequences of Rows That Match a Pattern

In some cases, you might need to identify sequences of table rows that match a pattern. For example, you might need to:

  • Determine which users followed a specific sequence of pages and actions on your website before opening a support ticket or making a purchase.

  • Find the stocks with prices that followed a V-shaped or W-shaped recovery over a period of time.

  • Look for patterns in sensor data that might indicate an upcoming system failure.

Data preparation:

CREATE OR REPLACE TABLE t
AS
SELECT 101 SESS_ID, 1 POL_ID, '2021-04-17 09:30:00'::DATE AS Trans_dt, 1 AS VERSION_ID
UNION ALL SELECT 101 SESS_ID, 1 POL_ID, '2021-04-17 09:35:00'::DATE AS Trans_dt, 2
UNION ALL SELECT 102 SESS_ID, 1 POL_ID, '2021-04-17 09:37:00'::DATE AS Trans_dt, 3
UNION ALL SELECT 102 SESS_ID, 1 POL_ID, '2021-04-17 09:38:00'::DATE AS Trans_dt, 4
UNION ALL SELECT 101 SESS_ID, 1 POL_ID, '2021-04-17 09:39:00'::DATE AS Trans_dt, 5
UNION ALL SELECT 101 SESS_ID, 1 POL_ID, '2021-04-17 09:40:00'::DATE AS Trans_dt, 6;

Query:

SELECT *
FROM t
MATCH_RECOGNIZE (
    PARTITION BY POL_ID
    ORDER BY VERSION_ID
    MEASURES MATCH_NUMBER() AS group_id
            --,CLASSIFIER() as cks
    ALL ROWS PER MATCH
    PATTERN (a+b*)
    DEFINE a as sess_id = FIRST_VALUE(sess_id)
          ,b AS sess_id != FIRST_VALUE(sess_id)
) mr
ORDER BY POL_ID, VERSION_ID;

db<>fiddle demo(Oracle)

Output:

SESS_ID POL_ID  TRANS_DT    VERSION_ID  GROUP_ID
101 1   2021-04-17  1   1
101 1   2021-04-17  2   1
102 1   2021-04-17  3   1
102 1   2021-04-17  4   1
101 1   2021-04-17  5   2
101 1   2021-04-17  6   2

How it works:

  1. Define a pattern:(a+b*) which is Perl-style regexp, a(one or more) b(zero or more)
  2. Define pattern components a(sess_id is the same as first element of group), b(sess_id is not the same as first element of group)
  3. Define measure MATCH_NUMBER() -"Returns the sequential number of the match"
  4. Perform this operation per each POL_ID and use VERSION_ID as sorting column
0
Gordon Linoff On

Basically, you want to use lag() to see when the session id changes. Then you want a cumulative sum, but only within each session id:

select t.*,
       sum(case when prev_session_id = session_id then 0 else 1 end) over (
           partition by pol_id, session_id
           order by trans_dt
          ) as grouping
from (select t.*,
             lag(session_id) over (partition by pol_id order by trans_dt) as prev_session_id
      from t
     ) t;

This is a tricky variant of a groups-and-islands problem. A more normal situation is for the three pairs of rows to be enuemrated 1, 2, and 3. For that, you would just remove session_id from the partition by in the sum().

0
Simeon Pilgrim On

So below it's not obivous how you want group_id to relate to pol_id so I have ignored it.

so using a CTE just for the fake data.

WITH data AS (
    SELECT * FROM VALUES
      (101, 1, '2021-04-17 09:30:00', 1),
      (101, 1, '2021-04-17 09:35:00', 2),
      (102, 1, '2021-04-17 09:37:00', 3),
      (102, 1, '2021-04-17 09:38:00', 4),
      (101, 1, '2021-04-17 09:39:00', 5),
      (101, 1, '2021-04-17 09:40:00', 6)
    v(sess_id, pol_id, trans_dt, version_id)
)

I would then want to write these operations:

SELECT *
    ,ROW_NUMBER() OVER (ORDER BY trans_dt) AS r1
    ,ROW_NUMBER() OVER (PARTITION BY sess_id ORDER BY trans_dt) AS r2
    ,r1- r2 as r3
    ,LAG(r3) OVER (PARTITION BY sess_id ORDER BY trans_dt ) as lag_r3
    ,IFF(lag_r3 != r3, 1, 0) as sess_edge
    ,SUM(sess_edge) OVER (ORDER BY trans_dt)+1 as GROUP_ID
FROM data

so r1 and r2 are finding when there is gap in sess_id with respect to trans_dt, then you are want the those changes of r3 and lag_r3 with respect to trans_dt, and those are the edges you are wanting to count, thus the SUM, which is zero based, so a +1 to get the value you want.

Now the above is not valid in Snowflake so needs to be layered to work:

SELECT
    *
    ,SUM(sess_edge) OVER (ORDER BY trans_dt)+1 as GROUP_ID
FROM (  
    SELECT
        *
        ,LAG(r3) OVER (PARTITION BY sess_id ORDER BY trans_dt ) as lag_r3
        ,IFF(lag_r3 != r3, 1, 0) as sess_edge
    FROM (
        SELECT *
            ,ROW_NUMBER() OVER (ORDER BY trans_dt) AS r1
            ,ROW_NUMBER() OVER (PARTITION BY sess_id ORDER BY trans_dt) AS r2
            ,r1- r2 as r3
        FROM data
    )
)
ORDER BY trans_dt;

which gives:

SESS_ID POL_ID  TRANS_DT            VERSION_ID  R1  R2  R3  LAG_R3  SESS_EDGE   GROUP_ID
101     1       2021-04-17 09:30:00 1           1   1   0   null    0           1
101     1       2021-04-17 09:35:00 2           2   2   0   0       0           1
102     1       2021-04-17 09:37:00 3           3   1   2   null    0           1
102     1       2021-04-17 09:38:00 4           4   2   2   2       0           1
101     1       2021-04-17 09:39:00 5           5   3   2   0       1           2
101     1       2021-04-17 09:40:00 6           6   4   2   2       0           2   

So it can be seen how it is working. This can then be compressed down to:

SELECT
    sess_id
    ,pol_id
    ,trans_dt
    ,version_id
    ,SUM(sess_edge) OVER (ORDER BY trans_dt)+1 as GROUP_ID
FROM (  
    SELECT
        *
        ,IFF(LAG(r3) OVER (PARTITION BY sess_id ORDER BY trans_dt ) != r3, 1, 0) as sess_edge
    FROM (
        SELECT *
            ,ROW_NUMBER() OVER (ORDER BY trans_dt)- ROW_NUMBER() OVER (PARTITION BY sess_id ORDER BY trans_dt) as r3
        FROM data
    )
)
ORDER BY trans_dt;

Which is way more complex than Gordon's answer, which rewritten into the same form as my is:

select *
    ,sum(edge) over ( partition by pol_id, sess_id order by trans_dt ) as grouping
from (
    select *
        ,lag(sess_id) over (partition by pol_id order by trans_dt) as prev_session_id
        ,iff(prev_session_id = sess_id, 0, 1) AS edge
    from data
)
ORDER BY 2,3;

which is rather clever as the SUMing the edges per sess_id

But if you add extra data:

WITH data AS (
    SELECT * FROM VALUES
      (101, 1, '2021-04-17 09:30:00', 1),
      (101, 1, '2021-04-17 09:35:00', 2),
      (102, 1, '2021-04-17 09:37:00', 3),
      (102, 1, '2021-04-17 09:38:00', 4),
      (101, 1, '2021-04-17 09:39:00', 5),
      (101, 1, '2021-04-17 09:40:00', 6),
      (102, 1, '2021-04-17 09:41:00', 7),
      (102, 1, '2021-04-17 09:42:00', 8),
      (103, 1, '2021-04-17 09:43:00', 9),
      (103, 1, '2021-04-17 09:44:00', 10)
    v(sess_id, pol_id, trans_dt, VERSION_ID)
)

Gordon's answer will assign the last two session the group 1, while mine will assign 2, as will Lukasz, it depends what your intended outcome is.

Also also what do you want to happen when pol_id changes do you want group to be a global count, or would the second pol have the value 1 again?