Data streams in case of Merge

180 views Asked by At

We are seeing enormous amounts of data-traffic to and fro our SSIS server. We cannot find the culprit. Is there any way to find out which package is causing all the trafffic? Any advice on that? We are thinking that maybe all the merges we do cause all the traffic. Our SSIS machine gets data from several production SQL servers, merges that with data in our warehouses. Dies that mean that

  • a) new data is transfered to the SSI machine,
  • b) existing data is transferred to the SSIS machine,
  • c) Merge is done and then all data is transferred to the warehouse?

Then how would you go about limiting all the data moved from and to?

1

There are 1 answers

1
MsSQLGirl On BEST ANSWER

The answer to your questions a, b and c (if you're using SSIS transformation components in SSIS) is essentially “yes, all new data and existing data required for transformation will flow into SSIS instance, and the resulting merged data will flow out of SSIS instance to the target server”. More detailed explanation is below.

Assuming that you are using SQL Server 2012 and above, you would be able to enable Verbose logging to capture the number of rows transferred. The details are captured in [catalog].[execution_data_statistics]. If you are looking for the size in bytes, you would need to calculate that based on the columns that are being extracted and transformed against the number of rows. The [catalog].[execution_data_statistics] captures package name, task name, data flow path and source/destination component name, the time of execution and execution path, which is great for diagnosing.

SSIS is an in-memory pipeline. If you have 3 separate servers, Source, SSIS and Target, the amount of data/traffic will vary. As an example, if the Data Flow Tasks require transformation and use components such as Merge, Merge Join, Lookup etc, you can expect data flowing from Source Server, SSIS Server and Target Server.

On the other hand if you are running a simple Data Flow Task with SQL Server Destination for the Target between 2 databases with the same source and target, SSIS will issue a BULK INSERT statement on the target (= source = SSIS server) instance. In this case, there will be very low data traffic across the network (at least not related to the BULK INSERT statement).

If your package contains an “Execute SQL Task” component that invoke MERGE t-sql statements, this would not cause data traffic into/out of SSIS Server. The activity will be done on the SQL Server instance that the MERGE statement is executed on. If you are using Linked Servers, then the data will flow into/out of linked server as required by the MERGE statement just the same way as if you're invoking the statement on the instance.

My recommendation for limiting the amount of data moved from and to, is to be selective at the source level. For example, if you know that you are only going to be using ColumnA, ColumnB, ColumnC in dbo.Customer, then use

SELECT [ColumnA], [ColumnB], [ColumnC]  FROM [dbo].[Customer] --
Better!

instead of the following statement which potentially can retrieve more than those 3 columns:

SELECT * 
FROM [dbo].[Customer]  -- Do Not Use

There are also a number of best practices to optimize SSIS including reducing bandwidth and optimizing the amount of data transferred, that you can follow. Please have a read here: http://blogs.msdn.com/b/sqlcat/archive/2013/09/16/top-10-sql-server-integration-services-best-practices.aspx. If you are working on Hybrid platform, you may also be interested in reading "SSIS for Azure and Hybrid Data Movement" white paper (https://msdn.microsoft.com/en-us/library/jj901708.aspx). This white paper has an additional link to "SSIS Operational and Tuning Guide" that would be useful as well.

In addition, you may also be interested in having a look at SSIS Reporting Pack available on CodePlex to get more visualization of SSIS executions on the server.

Hope this helps. Julie