For context I have a simple Azure SQL database and I have a theoretical dimensional model with like 12 dimensions and some fact tables that eventually should function as a DWH.
I am looking to apply historization on this dimenional model. To start simple and apply the concept of historization I want to enable it on one dimension. In my case this is the DimEmployee which looks like this with a row of sample data (it has more columns but for simplicity sake I only take these)
EmpKey | EmpName | EmpCity | DWHDatStart | DWHDatEnd | IsActive |
---|---|---|---|---|---|
52 | John | London | 02/02/2020 | 31/12/9999 | Y |
Now if lets say John moves to a different city I want this to be tracked so the wanted situation when the value of EmpCity is updated would be
EmpKey | EmpName | EmpCity | DWHDatStart | DWHDatEnd | IsActive |
---|---|---|---|---|---|
52 | John | Amsterdam | 08/06/2021 | 31/12/9999 | Y |
52 | John | London | 02/02/2020 | 08/06/2021 | N |
How would I go about applying this with TSQL. I assume it will have to be a trigger of some sorts but lack the specific knowledge to apply this. I've read about temporal tables and that usually when enabling historization on data you store the historical data in a separate history table. I think because of this models design there can always be expanded on this concept by adding DimEmployeeHist dimensions or whatever but for the first concept this is not needed. I don't know what value it would bring doing it in separate tables other then when you have lot's of history records so you want them all neatly stored in one place to not clutter up your main tables.
Does this have to be done in the create table script or can this be done after the table is created?
Firstly: forget about triggers. Triggers are bad
This is a braindump on how I generally do this. There are many permutations on this but this should give you an idea.
This is a row in the target dimension. Of course there are many other rows
SurrogateKey
is the key that joins to your fact and is unique in the dimension table (and should be enforced with a constraint or index)SourceKey
is the key in the source systemSourceSystem
is a code for whatever system provided this record.SourceKey
+SourceSystem
is unique (and should be enforced with a constraint or index)We load data into our datawarehouse. Typically the first step is to load the input data into a staging table. So say we have this in a staging table:
ActiveRecord
andIgnore
are working columns, they don't come from the source. Every other column does come from the source but knows nothing about the dimensionThere will obviously be many records in the dimension and staging table.
First exclude all records that are up to date in the dimension.
We've identified that Jim has not changed and can be ignored
identify all the records in the staging table that already have an active record in the dimension but have different attributes
(These two seperate updates could be combined into one if you want.)
Now our staging table looks like this.
Now we have enough info to update the dimension. We can write some more SQL to apply to the dimension based on the helper columns.
But first, Lets specify a fixed date. This stops weird things happening if this runs before and after midnight. Or you might want to determine this some other way, like an input parameter or data in the source
Now we can insert all new records (brand new or changed)
Now our dimension looks like this
Now we end date existing records:
So you basically wrap all of those T-SQL statements up in one stored procedure, add some transactions, logging, and error handling.
The stored procedure compares source (staging) and target (dimension) and does all the right things.
There are a bunch of other things to consider but this gives you an idea.