MS SQL Server: transforming SCD-2 to SCD-2 with less variables

399 views Asked by At

I retrieve data from SCD-2 table with many parameters and I need to build my own SCD-2 with only one of them. Therefore, I need to get rid of excessive intervals. Please recommend an algorithm to perform that in the best way.

What I receive from the source table:

what I receive from the source table

I need to transform it to:

I need to transform it to:

2

There are 2 answers

3
whynot On BEST ANSWER

You can use following steps to get the required result. Of course you can do it all in one step with sub-selects or CTEs, but for better traceability I prefere temporary tables.

DROP TABLE IF EXISTS #source;
CREATE TABLE #source (key1 integer, value1 integer, row_actual_from date, row_actual_to date);
 
INSERT INTO #source
VALUES
(19999923, 15,   '2020-01-01', '2020-01-02'),
(19999923, 15,   '2020-01-03', '2020-01-05'),
(19999923, 15,   '2020-01-06', '2020-01-08'),
(19999923, 11,   '2020-01-09', '2020-01-12'),
(19999923, 3434, '2020-01-13', '2020-01-15'),
(19999923, 11,   '2020-01-16', '2020-01-20'),
(19999923, 15,   '2020-01-21', '2020-02-02'),
(19999923, 3434, '2020-02-03', '2020-02-10'),
(19999923, 3434, '2020-02-11', '2020-02-19'),
(19999923, 3434, '2020-02-20', '2020-02-25'),
(19999923, 99,   '2020-02-26', '9999-12-31');

Step 1: Identify start and end for a single value period.

Note that in LAG/LEAD it is essentally to have a value as NULL replacement (-99 in example) which dosn't match with the possible values in the column.

    DROP TABLE IF EXISTS #step1;
    SELECT
        key1, value1, row_actual_from, row_actual_to
        , period_start = CASE WHEN LAG(value1,  1, -99) OVER (PARTITION BY key1 ORDER BY row_actual_from) <> value1 THEN 1 ELSE 0 END
        , period_end   = CASE WHEN LEAD(value1, 1, -99) OVER (PARTITION BY key1 ORDER BY row_actual_from) <> value1 THEN 1 ELSE 0 END
    INTO #step1
    FROM #source
    ORDER BY key1, row_actual_from;

Step 2: Filter on start/end rows and assign row_actual_to of end to start.

If the period of a value has only one row this row has period_start and period_end set to 1 and therefore the sum is 2. In this case the content of row_acutal_to has already the wanted value.

    DROP TABLE IF EXISTS #step2;
    SELECT
        key1, value1, row_actual_from, row_actual_to, period_start, period_end
      , valid_from = row_actual_from
      , valid_to   = CASE (period_start + period_end)
                     WHEN 1 THEN LEAD(row_actual_to, 1) OVER (PARTITION BY key1, value1 ORDER BY row_actual_from)
                     WHEN 2 THEN row_actual_to ELSE NULL END
    INTO #step2
    FROM #step1
    WHERE (period_start + period_end) > 0
    ORDER BY key1, row_actual_from;

Step 3: Filter (adjusted) start row of value period.

    SELECT key1, value1, valid_from, valid_to
    FROM   #step2
    WHERE  period_start = 1
    ORDER BY key1, row_actual_from;

0
NickW On

This is obviously complicated by the fact that the same "value" can repeat in multiple groups - so you can't just use straightforward MIN/MAX functions. You could probably code this in a javascript stored proc but I thought I'd try and find a solution in (almost) pure SQL.

The challenge is to try and create a "group" every time the value changes - as then you can do a simple MIN/MAX on the dates within the group. The way I have (hopefully!) solved this is as follows:

  1. Create a CTE where a calculated field is set to the next value in a sequence whenever the value of the current row is different from the value in the previous row; where there is no difference then the field is set to null - this is important due to the way the LAG function in the next CTE handles NULLs
  2. Create a 2nd CTE where a calculated grouping column is set to last non-null value in the calculated column created in the previous column - using the LAG function set to ignore nulls
  3. From the 2nd CTE, query the min and max date values grouping by the key, value and grouping columns

CODE

CREATE TABLE SRC_TABLE (key1 integer, value1 integer, row_actual_from date, row_actual_to date);

INSERT INTO SRC_TABLE
VALUES
(19999923, 15, '2020-01-01', '2020-01-02'),
(19999923, 15, '2020-01-03', '2020-01-05'),
(19999923, 15, '2020-01-06', '2020-01-08'),
(19999923, 3434, '2020-01-09', '2020-01-12'),
(19999923, 3434, '2020-01-13', '2020-01-15'),
(19999923, 15, '2020-01-16', '2020-01-20'),
(19999923, 15, '2020-01-21', '9999-12-31');


create or replace sequence seq_01 start = 1 increment = 1;
WITH T1 AS (
  SELECT KEY1, VALUE1, row_actual_from, row_actual_to
  ,CASE WHEN LAG(VALUE1,1,0) OVER (PARTITION BY KEY1 ORDER BY row_actual_from ASC) = VALUE1 THEN null ELSE seq_01.nextval END AS CHK_MIN
  from SRC_TABLE
  order by row_actual_from
),
T2 AS (
  SELECT KEY1, VALUE1, row_actual_from, row_actual_to, CHK_MIN
  ,CASE WHEN CHK_MIN IS NULL THEN LAG(CHK_MIN,1,0) IGNORE NULLS OVER (PARTITION BY KEY1 ORDER BY row_actual_from ASC) ELSE CHK_MIN END AS CHK_MIN_GRP
  FROM T1
)
SELECT KEY1, VALUE1, MIN(ROW_ACTUAL_FROM), MAX(ROW_ACTUAL_TO)
FROM T2
GROUP BY KEY1, VALUE1, CHK_MIN_GRP
;

RESULT

KEY1        VALUE1      MIN(ROW_ACTUAL_FROM)    MAX(ROW_ACTUAL_TO)
19999923        15      2020-01-01              2020-01-08
19999923        3434    2020-01-09              2020-01-15
19999923        15      2020-01-16              9999-12-31