I'm looking to load a dimension table that will contain a self-referencing key, and I want to understand what the most efficient / effective method would be.
Here's my setup: I have a Message table in my Staging environment that contains two keys, a primary key for the Message and a self-referencing foreign key for the first message in the thread. Simplified table structure with everything but keys taken out:
CREATE TABLE [dbo].[MsgMain](
[MsgMain_SK] [int] IDENTITY(1,1) NOT NULL,
[MsgMainPK1] [int] NULL,
[ThreadPK1] [int] NULL
) ON [PRIMARY]
Along with a clustered index on MsgMain_SK, there is a nonclustered index that includes both keys:
CREATE NONCLUSTERED INDEX [IX_MsgMain] ON [dbo].[MsgMain]
(
[MsgMainPK1] ASC,
[ThreadPK1] ASC
)
I'm loading data from that Message table into a Dimension table:
CREATE TABLE [dbo].[DimDiscussionPost](
[DiscussionPost_SK] [bigint] IDENTITY(1,1) NOT NULL,
[SrcDiscussionPostID] [int] NULL,
[ThreadStarter_SK] [int] NULL
) ON [PRIMARY]
(I have not yet created indexes on that table, but I plan to follow the same setup: clustered on the SK, non-clustered on the business and foreign keys).
My initial load process goes like this (will be converted to SSIS ETL package):
INSERT INTO [dbo].[DimDiscussionPost]
(
[SrcDiscussionPostID]
)
SELECT
MM.MsgMainPK1 AS SrcDiscussionPostID
FROM
GradebookSTG9.dbo.MsgMain MM
After loading the initial records, I then go back and run the following UPDATE
statement:
UPDATE dDP
SET dDP.ThreadStarter_SK = dTS.DiscussionPost_SK
FROM
GradeBookSTG9.dbo.MsgMain MM
INNER JOIN
MasterDM.dbo.DimDiscussionPost dDP ON
dDP.SrcDiscussionPostID = MM.MsgMainPK1
INNER JOIN
MasterDM.dbo.DimDiscussionPost dTS ON
MM.ThreadPK1 = dTS.SrcDiscussionPostID
Question One: is there a way to do both of these processes in one step?
Question Two: would the update be more efficient if I store the business key of ThreadPK1 in the table along with MsgMainPK1 (i.e., add SrcThreadStarterID)?
Question Three: after I convert this into an SSIS (ETL) package, would I be better off with:
- A trigger that will fire as each row is loaded (I'm using a bulk insert, if it matters)
- An Execute SQL task that runs the UPDATE statement above against all newly created / updated records
- A Lookup and Bulk Update data flow that will do the same logical UPDATE
- Something else unspeakably ingenious that you are about to suggest
Sample Data:
+-----------------+-----------+ | MsgMainPK1 | ThreadPK1 | | 1234 | 1234 | | 1235 | 1234 | | 1236 | 1234 | | 1237 | 1234 | | 1238 | 1234 | | 1239 | 1239 | | 1240 | 1240 | | 1241 | 1240 | | 1242 | 1234 | +-----------------+-----------+
Desired Data:
+-------------------+---------------------+------------------+--------------------+ | DiscussionPost_SK | SrcDiscussionPostID | ThreadStarter_SK | SrcThreadStarterID | +-------------------+---------------------+------------------+--------------------+ | 1 | 1234 | 1 | 1234 | | 2 | 1235 | 1 | 1234 | | 3 | 1236 | 1 | 1234 | | 4 | 1237 | 1 | 1234 | | 5 | 1238 | 1 | 1234 | | 6 | 1239 | 6 | 1239 | | 7 | 1240 | 7 | 1240 | | 8 | 1241 | 8 | 1240 | | 9 | 1242 | 1 | 1234 | +-------------------+---------------------+------------------+--------------------+
You can do it in one go.
I don't fully understand what you mean in third question. You should load the data into your staging table, then insert data into your dimension table. This process should be triggered by availability of the source data i.e. file has arrived or certain time of the day specified by your SLA. I don't know enough about your import process to answer that I'm afraid.