Creating SCD History in Redshift Table with SQL

222 views Asked by At

Problem

I have a starting table that looks like this:

StartingData

I'm attempting to write a SQL query that will write this data into a similarly formatted table, but with additional adornments indicating when records expired and what records are active. The results would look like this:

identifier                       | loaddate   | loadenddate | activeflag | symbol
723a90699e99ec9e00216910910384bd | 2020-04-01 | 2020-04-07  | 0          | DXB
723a90699e99ec9e00216910910384bd | 2020-04-08 | 2999-12-31  | 1          | DXB CL

Note that there are 1000s of different identifiers, some with one, two, three+ different symbols over various different timeframes.

Requirements

  1. Anytime an identifier is seen for the first time, it must be created in the final table with today's date as a load date and a 2999-12-31 loadenddate and activeflag=1
  2. When that identifier is seen on a second day, only add a row if the symbol has changed. If it has, "expire" the previous row by setting the previous row's loadenddate to this new row's loaddate - 1 day and activeflag = 0
  3. The sql query (or queries) need to be able to re-run on the source table each day going forward as well, so they correctly handle existing data in the destination table as well as the destination table being blank (initial run)

What I've Got So Far

To load initially (and not duplicate), I have the following SQL:

INSERT INTO finaltable(
listinghashkey
symbol,
loaddate,
loadenddate,
activeflag
)
SELECT
s.listinghashkey
s.symbol,
MAX(s.loaddate),
'2999-12-31 00:00:00.0',
1
FROM
startingtable s
LEFT JOIN finaltable f ON s.listinghashkey = f.listinghashkey
WHERE (f.listinghashkey IS NULL)
GROUP BY s.listinghashkey, s.symbol
1

There are 1 answers

0
Gordon Linoff On BEST ANSWER

Converting your initial format to the new format is pretty easy as a gaps-and-islands problem:

select identifier, symbol, min(loaddate),
       nullif(max(loaddate), max_loaddate)
from (select s.*,
             max(loaddate) over () as max_loaddate,
             row_number() over (partition by identifier order by loaddate) as seqnum,
             row_number() over (partition by identifier, symbol order by loaddate) as seqnum_2
      from startingtable
     ) s
group by identifier, symbol, (seqnum - seqnum_2);

This eschews the "is active" flag and the arbitrary future date. It just uses NULL for the infinite future. (You can easily tweak the logic for your version; this is just simpler to work with.)

If you have this table and you want to add the next load date, then you have can construct the next version in its entirety using union all. The idea is to divide the processing into four steps:

  • Historical records which are already closed so the new data cannot affect them.
  • Records where the new data is consistent with the existing data, so nothing changes.
  • Records that are not in the new data so the existing record needs to be closed.
  • New records.

The SQL looks like:

-- all completed records
select ft.identifier, ft.symbol, ft.loaddate, ft.loadenddate
from finaltable ft
where loadenddate is not null
union all
-- Basically copy over records where the new data is consistent
select ft.identifer, ft.symbol, ft.loaddate, ft.loadenddate
from finaltable ft join
     oneload ol
     on ft.identifier = ol.identifier and
        ft.symbol = ol.symbol
where ft.loadenddate is null
union all
-- close records that are not in the new batch
select ft.identifer, ft.symbol, ft.loaddate, m.loaddate - interval '1 day'
from finaltable ft cross join
     (select max(loaddate) as loaddate
      from oneload
     ) m left join
     oneload ol
     on ft.identifier = ol.identifier and
        ft.symbol = ol.symbol
where ft.loadenddate is null
-- finally add new records
select ol.identifer, ol.symbol, ol.loaddate, null
from oneload ol left join
     finaltable ft
     on ft.identifier = ol.identifier and
        ft.symbol = ol.symbol and
        ft.loadenddate is null
where ft.identifier is null;

I prefer to do this as a select/replace operation rather than a series of insert/update steps (or using merge perhaps). But you this is the basic idea.