Load a dimension table with a self-referencing foreign key

1.2k views Asked by At

I'm looking to load a dimension table that will contain a self-referencing key, and I want to understand what the most efficient / effective method would be.

Here's my setup: I have a Message table in my Staging environment that contains two keys, a primary key for the Message and a self-referencing foreign key for the first message in the thread. Simplified table structure with everything but keys taken out:

CREATE TABLE [dbo].[MsgMain](
    [MsgMain_SK] [int] IDENTITY(1,1) NOT NULL,
    [MsgMainPK1] [int] NULL,
    [ThreadPK1] [int] NULL
) ON [PRIMARY]

Along with a clustered index on MsgMain_SK, there is a nonclustered index that includes both keys:

CREATE NONCLUSTERED INDEX [IX_MsgMain] ON [dbo].[MsgMain]
(
    [MsgMainPK1] ASC,
    [ThreadPK1] ASC
)

I'm loading data from that Message table into a Dimension table:

CREATE TABLE [dbo].[DimDiscussionPost](
    [DiscussionPost_SK] [bigint] IDENTITY(1,1) NOT NULL,
    [SrcDiscussionPostID] [int] NULL,
    [ThreadStarter_SK] [int] NULL
) ON [PRIMARY]

(I have not yet created indexes on that table, but I plan to follow the same setup: clustered on the SK, non-clustered on the business and foreign keys).

My initial load process goes like this (will be converted to SSIS ETL package):

INSERT INTO [dbo].[DimDiscussionPost]
  (
    [SrcDiscussionPostID]
  )
SELECT 
    MM.MsgMainPK1 AS SrcDiscussionPostID
FROM 
    GradebookSTG9.dbo.MsgMain MM 

After loading the initial records, I then go back and run the following UPDATE statement:

UPDATE dDP
SET dDP.ThreadStarter_SK = dTS.DiscussionPost_SK
FROM 
    GradeBookSTG9.dbo.MsgMain MM 
     INNER JOIN
    MasterDM.dbo.DimDiscussionPost dDP ON 
        dDP.SrcDiscussionPostID = MM.MsgMainPK1
     INNER JOIN 
    MasterDM.dbo.DimDiscussionPost dTS ON 
        MM.ThreadPK1 = dTS.SrcDiscussionPostID 

Question One: is there a way to do both of these processes in one step?

Question Two: would the update be more efficient if I store the business key of ThreadPK1 in the table along with MsgMainPK1 (i.e., add SrcThreadStarterID)?

Question Three: after I convert this into an SSIS (ETL) package, would I be better off with:

  1. A trigger that will fire as each row is loaded (I'm using a bulk insert, if it matters)
  2. An Execute SQL task that runs the UPDATE statement above against all newly created / updated records
  3. A Lookup and Bulk Update data flow that will do the same logical UPDATE
  4. Something else unspeakably ingenious that you are about to suggest

Sample Data:

+-----------------+-----------+
|     MsgMainPK1  | ThreadPK1 |
|     1234        | 1234      |
|     1235        | 1234      |
|     1236        | 1234      |
|     1237        | 1234      |
|     1238        | 1234      |
|     1239        | 1239      |
|     1240        | 1240      |
|     1241        | 1240      |
|     1242        | 1234      |
+-----------------+-----------+

Desired Data:

+-------------------+---------------------+------------------+--------------------+
| DiscussionPost_SK | SrcDiscussionPostID | ThreadStarter_SK | SrcThreadStarterID |
+-------------------+---------------------+------------------+--------------------+
|                 1 |                1234 |                1 |               1234 |
|                 2 |                1235 |                1 |               1234 |
|                 3 |                1236 |                1 |               1234 |
|                 4 |                1237 |                1 |               1234 |
|                 5 |                1238 |                1 |               1234 |
|                 6 |                1239 |                6 |               1239 |
|                 7 |                1240 |                7 |               1240 |
|                 8 |                1241 |                8 |               1240 |
|                 9 |                1242 |                1 |               1234 |
+-------------------+---------------------+------------------+--------------------+
1

There are 1 answers

1
TomT On

You can do it in one go.

WITH ordered_messages AS
(
  SELECT ThreadPK1, MsgMainPK1, 
    ROW_NUMBER() OVER(PARTITION BY ThreadPK1 ORDER BY MsgMainPK1) seq
  FROM MsgMain
)
INSERT INTO DimDiscussionPost(SrcDiscussionPostID, ThreadStarter_SK)
SELECT mm.MsgMainPK1, om.seq
FROM MsgMain mm
INNER JOIN ordered_messages om on om.ThreadPK1 = mm.ThreadPK1 and om.seq=1

I don't fully understand what you mean in third question. You should load the data into your staging table, then insert data into your dimension table. This process should be triggered by availability of the source data i.e. file has arrived or certain time of the day specified by your SLA. I don't know enough about your import process to answer that I'm afraid.