Change tracking/history keeping in SQL database using TSQL script in single table

682 views Asked by At

For context I have a simple Azure SQL database and I have a theoretical dimensional model with like 12 dimensions and some fact tables that eventually should function as a DWH.

I am looking to apply historization on this dimenional model. To start simple and apply the concept of historization I want to enable it on one dimension. In my case this is the DimEmployee which looks like this with a row of sample data (it has more columns but for simplicity sake I only take these)

EmpKey EmpName EmpCity DWHDatStart DWHDatEnd IsActive
52 John London 02/02/2020 31/12/9999 Y

Now if lets say John moves to a different city I want this to be tracked so the wanted situation when the value of EmpCity is updated would be

EmpKey EmpName EmpCity DWHDatStart DWHDatEnd IsActive
52 John Amsterdam 08/06/2021 31/12/9999 Y
52 John London 02/02/2020 08/06/2021 N

How would I go about applying this with TSQL. I assume it will have to be a trigger of some sorts but lack the specific knowledge to apply this. I've read about temporal tables and that usually when enabling historization on data you store the historical data in a separate history table. I think because of this models design there can always be expanded on this concept by adding DimEmployeeHist dimensions or whatever but for the first concept this is not needed. I don't know what value it would bring doing it in separate tables other then when you have lot's of history records so you want them all neatly stored in one place to not clutter up your main tables.

Does this have to be done in the create table script or can this be done after the table is created?

1

There are 1 answers

1
Nick.Mc On BEST ANSWER

Firstly: forget about triggers. Triggers are bad

This is a braindump on how I generally do this. There are many permutations on this but this should give you an idea.

This is a row in the target dimension. Of course there are many other rows

SurrogateKey SourceKey SourceSystem EmpName EmpCity DWHDatStart DWHDatEnd IsActive
3678 52 HRSystem1 John London 2020-02-20 9999-12-31 Y
3642 73 HRSystem1 Jim Brisbane 2021-03-18 9999-12-31 Y
  • SurrogateKey is the key that joins to your fact and is unique in the dimension table (and should be enforced with a constraint or index)
  • SourceKey is the key in the source system
  • SourceSystem is a code for whatever system provided this record.
  • SourceKey + SourceSystem is unique (and should be enforced with a constraint or index)

We load data into our datawarehouse. Typically the first step is to load the input data into a staging table. So say we have this in a staging table:

EmpKey EmpName EmpCity ActiveRecordSurrogateKey Ignore
52 John Amsterdam NULL NULL
73 Jim Brisbane NULL NULL
7 Jack Texas NULL NULL

ActiveRecord and Ignore are working columns, they don't come from the source. Every other column does come from the source but knows nothing about the dimension

There will obviously be many records in the dimension and staging table.

First exclude all records that are up to date in the dimension.

UPDATE StagingTable
SET Ignore = 'Y'
FROM StagingTable TGT
INNER JOIN DimensionTable SRC
ON TGT.EmpKey = SRC.SourceKey
AND TGT.SourceKey = 'HRSystem1'
AND TGT.EmpName=SRC.EmpName 
AND TGT.EmpCity=SRC.EmpCity
AND SRC.IsActive = 'Y'

We've identified that Jim has not changed and can be ignored

EmpKey EmpName EmpCity ActiveRecordSurrogateKey Ignore
52 John Amsterdam NULL NULL
73 Jim Brisbane NULL Y
7 Jack Texas NULL NULL

identify all the records in the staging table that already have an active record in the dimension but have different attributes

UPDATE StagingTable
SET ActiveRecordSurrogateKey = SRC.SurrogateKey
FROM StagingTable TGT
INNER JOIN DimensionTable SRC
ON TGT.EmpKey = SRC.SourceKey
AND TGT.SourceKey = 'HRSystem1'
AND TGT.IsActive='Y'
AND (TGT.EmpName<>SRC.EmpName OR TGT.EmpCity<>SRC.EmpCity)

(These two seperate updates could be combined into one if you want.)

Now our staging table looks like this.

EmpKey EmpName EmpCity ActiveRecordSurrogateKey Ignore
52 John Amsterdam 3678 NULL
73 Jim Brisbane NULL Y
7 Jack Texas NULL NULL

Now we have enough info to update the dimension. We can write some more SQL to apply to the dimension based on the helper columns.

But first, Lets specify a fixed date. This stops weird things happening if this runs before and after midnight. Or you might want to determine this some other way, like an input parameter or data in the source

DECLARE @Date DATE = GETDATE();

Now we can insert all new records (brand new or changed)

-- This line inserts new dimension records:
INSERT INTO DimensionTable (SourceKey,SourceSystem,EmpName,EmpCity, StartDate,EndDate,IsActive)
SELECT EmpKey,'HRSystem1',EmpName,EmpCity, @Date,'2999-01-01','Y' 
FROM StagingTable 
WHERE Ignore IS NULL

Now our dimension looks like this

SurrogateKey SourceKey SourceSystem EmpName EmpCity DWHDatStart DWHDatEnd IsActive
3678 52 HRSystem1 John London 2020-02-20 9999-12-31 Y
3642 73 HRSystem1 Jim Brisbane 2021-03-18 9999-12-31 Y
3693 7 HRSystem1 Jack Texas 2021-06-09 9999-12-31 Y
3694 52 HRSystem1 John Amsterdam 2021-06-09 9999-12-31 Y

Now we end date existing records:

-- This line end-dates existing records:
UPDATE DimensionTable
SET DWEndDate = @Date, Active = 'N'
FROM DimensionTable TGT
INNER JOIN StagingTable SRC
ON TGT.SurrogateKey  = ActiveRecordSurrogateKey
SurrogateKey SourceKey SourceSystem EmpName EmpCity DWHDatStart DWHDatEnd IsActive
3678 52 HRSystem1 John London 2020-02-20 2021-06-09 N
3642 73 HRSystem1 Jim Brisbane 2021-03-18 9999-12-31 Y
3693 52 HRSystem1 Jack Texas 2021-06-09 9999-12-31 Y
3694 52 HRSystem1 John Amsterdam 2021-06-09 9999-12-31 Y

So you basically wrap all of those T-SQL statements up in one stored procedure, add some transactions, logging, and error handling.

CREATE PROC pUpdateDimPerson
AS
BEGIN

-- All the code above

END

The stored procedure compares source (staging) and target (dimension) and does all the right things.

There are a bunch of other things to consider but this gives you an idea.