I am building specific Fact Table for a Kimball-based EDW.
While loading fact table I came upon the realization on the different approaches.
Below the steps taken:
The Source Data
Learner Absenteeism data.
Learners are sometimes absent and these absenteeism's are capture again a learner.
It can happen that the reason for absenteeism can change a few days later.
It can also happen that the record is completely deleted a few days later.
Source system has no form of Change Control features to indicate changed/deleted records.
Sourcing Data
When sourcing the data only the current year's data is sourced.
And no other data is kept in stage.
Lookup Surrogates
- Lookups are performed to replace natural keys with surrogate keys of the relating dimensions.
Load Data in Fact Table
- So now a clean conformed data set exist that need to to be loaded into the final Fact Table
Option One - Fresh Batch
DELETE FROM FactLearnerAbsenteeism WHERE SchoolYear = @CurrentYear
INSERT INTO FactLearnerAbsenteeism SELECT * FROM #Conform
Option Two - The Merge
MERGE dbo.FactLearnerAbsenteeism as DST
USING #Conform as SRC
*UPDATE CHANGES*
*INSERT NEW RECORDS*
My concern is: How I can i ensure records that no longer exist this year (deleted records) are removed. I prefer using the merge but using the DELETE option in MERGE can remove records from the previous years as they wont exist in the #Conform dataset.
Should I the rather source ALL data? Should I keep the previous years data in STAGE and only source the current years data?
I would propose that you add a technical column "IsValid" in your fact table, set by default to 0. Then your loading process could be as follows :
Step 1 UPDATE FactLearnerAbsenteism SET IsValid = 0 WHERE SchoolYear = @CurrentYear
Step 2 Perform the Merge, and set Isvalid to 1 for each updated or inserted record.
Step 3 Decide what you want to do with the invalidated records : either you delete them in a last pass, or you keep them as it can represent an interesting piece of information enabling you to track the deletions in the source systems.
Disadvantage is that analysts need to be aware of this isValid parameter in their queries, to avoid overestimating the absentees, but I tend to prefer this as it captures the maximum information.