MS SQL Server: using CDC to populate single destination table from several source tables

2.5k views Asked by At

Can I use Change Data Capture in MS SQL Server (2008 or 2012) with the SSIS Package which joins several source tables into one destination table?

Technet articles describe CDC + SSIS usage cases when source table and destination table have the same structure. The only hint to the possibility of change tracking for custom data transformations is that it is possible to specify the columns for which CDC will track changes.

The problem is, I need to combine data from a number of source tables to get the destination table and then keep it in sync with those source tables.

This is because the data in the destination datawarehouse is normalized to lesser extent than in the source database. For example, I have Events table (containting Computer ID, Date/Time, and Event Description) and Computers table (containting Computer ID and Computer Name). I don't need those normalized tables and computer ids in the destination table, so the select to fill the destination tables should be:

INSERT INTO DestDB..ComputerEvents (ComputerName, DateTime, Event) 
SELECT s.ComputerName, e.DateTime, e.Event
FROM SourceDB..EventLog e
JOIN SourceDB..ComputerNames s
ON e.CompID = s.CompID

I just cannot figure out how to make CDC work with SSIS Package containing such transformation? Is it even possible?

3

There are 3 answers

0
sam On

OK first thing CDC captures changes in a table, so if there was some insert or delete or update in a table then a CDC record gets created with an indicator to say insert or update or delete and all the CDC task does is output records to one of the three outputs based on that indicator column so coming back to your question you might have to have multiple OLDEDB Sources and CDC Task for each Source and UNION ALL similar operations (insert,update, delete) together and then the Destination component or OLEDB Command component hope this helps :)

0
Mike Sofen On

Consider CDC as if it were your automated mechanism for filling staging tables (instead of a sql query, or replication), using one CDC source table pointed at one regular staging table. From there simply build your joined queries against the multiple staging tables as needed.

My assumption is, that you are pulling data from non-identical tables, like

  • an Order table,
  • an OrderDetail table, etc.

If you are pulling from several identical tables in the same or different dbs, then you can push the output of the CDC directly into the staging table and you're done.

0
Bill On

To answer the question: No, you can't.

As one other responder has pointed out, CDC can only tell you what changed in EACH source table since the last time you extracted changes.

Using CDC to extract changes from multiple source tables to load a single destination table is anything but simple.

Let's show why by means of an example. For this example I assume that a staging table is a table that is truncated routinely before being populated.

Suppose we have two source tables: Order, OrderDetail. We have one destination fact table FactOrder. FactOrder contains the OrderKey (from Order) and the sum of order amount from OrderDetail. A customer orders 3 products. One Order and 3 OrderDetail records are inserted into the source database tables. Our DW ETL extracts the 1 order record (insert) and 3 OrderDetail records (insert). If we chose to load changed records into staging tables as a previous responder said we could simply join our staging tables to create our FactOrder record. But, what happens if the we no longer carry one of the products and someone deletes a record from the OrderDetail record. The next DW ETL extracts 1 OrderDetail record (delete). How do we use this information to update the target table? Clearly we can't join from Order to OrderDetail because Order has no record for this particular OrderKey since it is a staging table that we just truncated. I chose a delete example but consider the same problem if dependent tables are updated.

What I propose instead is to extract the distinct set of primary key (OrderKey in our example) values for which there are changes in the any of the source tables required to build the FactOrder record and then extract the full FactOrder record in a subsequent request. For example, if 5 Order records are changed we know the 5 OrderKey values. If 30 OrderDetail records are changed we need to determine the distinct set of OrderKey values. Let's say that is 10 OrderKey. We then union the two sets. Let's say that there is overlap so that yields 12 OrderKey values. Now we seed our FactOrder extract query with the 12 OrderKey values. We get back 12 complete FactOrder records. We then use comparison of new to stored binary checksum to determine how to action the 12 records (insert or update). The above approach does not cover deletes from the Order table. Those would result in trivial deletes from FactOrder.

The many examples out there as you noted show how to use CDC to replicate/synchronize data from 1 source to 1 destination which isn't a typical data warehouse load use case since the tables in the data warehouse are typically denormalized (thus requiring joins among multiple source tables to build the destination row).