Excuse me, I'm a newbie (first question) and also new to SSIS 2008, so apologies if the question isn't constructed in an efficient way.
Context: I wish to extract all the usage records from a source database on server A, to a target database on server B, based on values in a reference table on target database (server B). Note: the source database is read only. If I was in a position to link servers, the SQL statement would look like below, the problem is, I am not (in a position to link servers). I also need to extract this data as part of a SSIS package.
My question: What SSIS functionality can I use to achieve the same result as the below SQL statement within a SSIS package. I understand that it is better for performance if the SQL statement is executed in the source database itself, TABLE 1 (reference table) contains 6 million rows, table 2 (customer usage table) contains 500 million rows.
The obvious solution might be to use a lookup transformation but I'm not sure if the 'BETWEEN' part of the statement is possible within a lookup, and as mentioned, due to performance issues I understand the ideal solution would be to pass the values to the SQL statement itself.
I understand this problem could be solved in SQL server 2012 with the use of parameters, however, this functionality is not available in 2008, and I have been advised to achieve the same with the use of a configuration file. I have looked into this, but haven't been able to figure out how to do it for two reasons:
- how to update the XML file based on values from the reference table,
- I don't know how to pass multiple rows to a variable to be used in the select statement
Any kind of advice about any of the above and how to achieve the desired outcome of efficiently inserting 500 million rows into target database table based on data from the reference table would be welcomed!
SELECT a.customer_id (unique primary key)
,a.start_timestamp
,a.volume
FROM ODBC_READONLY_SOURCE.CUSTOMER_USAGE A
INNER JOIN TARGET_DATEBASE.REFERENCE_TABLE B
ON a.customer_id = b.customer_id
WHERE a.start_time_stamp BETWEEN b.start_time_stamp AND b.end_time_stamp
TABLE 1: (SQL Server 2008 R2 standard edition database)
TARGET_DATEBASE.REFERENCE_TABLE
(customer_id VARCHAR(64)
,start_time_stamp (DATE)
,end_time_stamp (DATE))
unique primary key (customer_id);
TABLE 2: (DB2 9.7 read only ODBC connection)
ODBC_READONLY_SOURCE.CUSTOMER_USAGE
(usage_record_id (int)
,customer_id VARCHAR(64)
,start_time_stamp (DATE)
,volume (int))
unique primary key (usage_record_id);
The Best solution is to import your Source data as-is into a Staging table on Server B, and then execute a stored procedure that joins the Staging table to the Reference Table and inserts the desired results into the Target table.
If you must do this in the dataflow for some reason, a Lookup CAN solve the problem, but Lookups are best with small reference tables that can be cached.
In your case, if I had to do this in a dataflow I would use a Merge Join transformation. They are much faster with large reference tables.