Best way to process 1400 million records in SQL

152 views Asked by At

I humbly request you to help me with the below scenario, In my project I have to process time series data from a table. We are using Azure SQL Server.

The table dbo.batch_events has 1400 million rows, please see the below screenshot for the table structure and sample data:

Source table structure and sample data

I have to pivot the equipment name from the equipment_name column in the dbo.batch_events table into column names and load the pivoted value into dbo.time_series table .

If I pivot the equipment name column, I am getting 685 columns to be created in the dbo.time_series table.

Please see the below screenshot showing the target table (dbo.time_series) structure and expected output for the sample data shown in the source table.

Target table structure and sample data

Please kindly advise what is the best approach and method to write the query in SQL.

The query I wrote is taking 25 hours to process the 1400 million records and load them into the target table.

I have created day wise partition on time_stamp column for the source table (dbo.batch_events) and created two nonclustered indexes - one on equipment name and the other on time stamp column.

I humbly request you to advise me on the best approach to write the query for this scenario.

The stored procedure I have created process one month of data at a time; for one month, we have around 120 million rows to deal with.

  1. Used While loop to put entry for start_date and end_date for each month into dbo.Iteration_ctrl table, by getting the minimum and maximum date from the dbo.Batch_events table. So, I have 12 entries in this table, each entry representing one month.

  2. Used while loop to iterate through the 12 start_date and end_date entries in the dbo.Iteration_ctrl table, and used pivot query inside the while loop to load the data into dbo.Time_series table.

Please see the stored procedure I have written, which is taking 25 hours (which seems to me inefficient). Any help would be appreciated please.

DECLARE @MIN_TIME DATETIME, @MIN_TIMESTAMP  DATETIME;
DECLARE @MAX_TIME DATETIME, @MAX_TIMESTAMP  DATETIME;
DECLARE @DATE DATETIME, @ROWCOUNT INT, @TOTALCOUNT INT;

SELECT @MIN_TIME = MIN(Time_stamp)  
FROM [dbo].[BATCH_EVENTS] 

SELECT @MAX_TIME = MAX(Time_stamp)  
FROM [dbo].[BATCH_EVENTS] 

PRINT 'INSERT INTO  TABLE  [dbo].[ITERATION_CTRL] HAS STARTED ' + CAST(GETDATE() AS nvarchar(30)) 

WHILE @MIN_TIME < @MAX_TIME
BEGIN
    SELECT @DATE = DATEADD(MM, 01, @MIN_TIME)             
    SELECT @DATE = CASE WHEN @DATE > @MAX_TIME THEN @MAX_TIME ELSE @DATE END            

    INSERT INTO dbo.ITERATION_CTRL
        SELECT @MIN_TIME, @DATE

    PRINT 'INSERTION INTO TABLE  [dbo].[ITERATION_CTRL] HAS ENDED FOR'+ CAST(@MIN_TIME AS nvarchar(30)) + ' -' + CAST( @DATE AS nvarchar(30)) +' NO OF ROWS INSERTED :'+ CAST( @@ROWCOUNT AS nvarchar(30)) +' ' + CAST(GETDATE() AS nvarchar(30))      

    SELECT @MIN_TIME = DATEADD(SS, 01, @DATE)   
END

PRINT 'INSERT INTO  TABLE  [dbo].[Time_series_data] HAS STARTED ' + CAST(GETDATE() AS nvarchar(30))              

SELECT @TOTALCOUNT = COUNT(*) FROM dbo.ITERATION_CTRL 
SELECT @ROWCOUNT = 1

WHILE @ROWCOUNT <= @TOTALCOUNT
BEGIN
    SELECT 
        @MIN_TIMESTAMP = MIN_DATE, 
        @MAX_TIMESTAMP = MAX_DATE  
    FROM dbo.ITERATION_CTRL 
    WHERE ID = @ROWCOUNT

    BEGIN TRANSACTION
       INSERT INTO dbo.Time_series_data
           SELECT *
           FROM 
               (SELECT 
                    [Event_name], [Time_Stamp],
                    [Start_time], [End_time], [Duration],
                    [Value] AS [Sensor_Value],
                    Equipment_name
                FROM 
                    [dbo].[BATCH_EVENTS] BE
                WHERE 
                    Time_stamp >= [Start_time] AND Time_stamp <= [End_time]
                    AND Time_stamp BETWEEN @MIN_TIMESTAMP AND @MAX_TIMESTAMP) t
           PIVOT 
               (MAX([Sensor_Value])
                    FOR Equipment_Name IN ([MY1102], [MY1138], [MY1180],
                                           [MY1164], [MY1176], [MY204],
                                           [MY324], [MY64B6])
          ORDER BY 
              [Time_Stamp], [Event_name]
  
        COMMIT TRANSACTION

        SELECT @ROWCOUNT = @ROWCOUNT + 1

        --PRINT @MIN_TIMESTAMP, @MAX_TIMESTAMP
        PRINT 'INSERTION INTO TABLE   [Time_series_data] HAS ENDED NO OF ROWS INSERTED :'+ CAST( @@ROWCOUNT AS nvarchar(30)) +' For duration ' + CAST( @MIN_TIMESTAMP AS nvarchar(30))+ '   '+  CAST( @MAX_TIMESTAMP AS nvarchar(30))+' Time '+ CAST(GETDATE() AS nvarchar(30));
    END
END
0

There are 0 answers