I need to update a large table, roughly 3B rows. Each record needs to update 4 fields of decimal(18,2) datatype. I have recreated the table with the corrected data to update the original table. Below is the statement I have been testing. It works, but It takes about 30 minutes to update 6m records. This means that updating the Original table will take a very long time.
First, when I test this statement it starts off pretty fast but then slows down significantly as more records are updated. Why is this the case and what can I alter to speed up the process?
Second, I have been working on simplifying the joins and where clauses, but this was a pretty complex table to begin with. Based on the statement below, what else can I do to improve performance and shorten the time to update? For example, I know its standard to do batches in rows of 10k, but is it worthwhile to try more like 100k or even 1m at a time? Or would that cause some other unforseen issue like filling up logs disk space or filling up disk space in the temp directory?
use Stage
go
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
go
DECLARE @batchId INT
DECLARE @batchSize INT
DECLARE @results INT
SET @results = 1
SET @batchSize = 10000
SET @batchId = 0 -- when 0 rows returned, exit the loop
Drop table If Exists #tempupdate
select
Identity(int,1,1) as rowkey,
a.Rm835HeaderId
,a.Rm835ClaimHeaderId
,a.Rm835ClaimPmtId
,a.Rm835ClaimServiceLineId
,a.ClaimLineInd
,a.ClaimAdjustGroupCode
,a.RemitDate
,b.totalclaimchargeamt as TotalClaimChargeAmt_d
,b.ClaimPaymentAmt as ClaimPaymentAmt_d
,b.LineitemChargeamt as lineitemchargeamt_d
,b.Amount as amount_d
,(a.amount - b.amount) as difff
into #tempupdate
from [stage].[stage].[DetailRefreshtest]a
inner join [RealMedstage].[reporting].[DetailRefresh] b on
a.HeaderId = b.HeaderId
and a.ClaimHeaderId = b.ClaimHeaderId
and a.ClaimPmtId = b.ClaimPmtId
and a.ClaimServiceLineId = b.ClaimServiceLineId
and a.ClaimLineInd = b.ClaimLineInd
and a.ClaimAdjustGroupCode = b.ClaimAdjustGroupCode
where a.RemitDate between '2021-04-01 00:00:00.000'and '2021-04-02 00:00:00.000'
and b.RemitDate between '2021-04-01 00:00:00.000'and '2021-04-02 00:00:00.000'
and a.amount is not null and abs(a.amount - b.amount)<=.5
WHILE (@results > 0)
BEGIN
BEGIN TRY
BEGIN TRAN;
update [stage].[stage].[DetailRefreshtest]
set [TotalClaimChargeAmt] =b.TotalClaimChargeAmt_d
,[ClaimPaymentAmt] = b.ClaimPaymentAmt_d
,[LineItemChargeAmt] = b.lineitemchargeamt_d
,[Amount] = b.amount_d
,[opsinsertid]=0
from #tempupdate b
inner [stage].[stage].[DetailRefreshtest] a on
a.HeaderId = b.HeaderId
and a.ClaimHeaderId = b.ClaimHeaderId
and a.ClaimPmtId = b.ClaimPmtId
and a.ClaimServiceLineId = b.ClaimServiceLineId
and a.ClaimLineInd = b.ClaimLineInd
and a.ClaimAdjustGroupCode = b.ClaimAdjustGroupCode
where a.amount is not null and abs(b.difff)<=.5 --(a.amount - b.amount_d)<=.5 and (a.amount - b.amount_d)>=-.5
and (b.rowkey >@batchId and b.rowkey<= @batchId + @batchSize)
SET @results = @@ROWCOUNT
-- next batch
SET @batchId = @batchId + @batchSize
COMMIT TRAN;
END TRY
BEGIN CATCH
SELECT
ERROR_NUMBER() AS ErrorNumber,
ERROR_SEVERITY() AS ErrorSeverity,
ERROR_STATE() AS ErrorState,
ERROR_PROCEDURE() AS ErrorProcedure,
ERROR_LINE() AS ErrorLine,
ERROR_MESSAGE() AS ErrorMessage
rollback tran;
END CATCH
End
Let's say
aand #tempupdate which we'll callbahas a primary key (or, more importantly, clustered index) calledHeaderId. If your primary key is non-clustered, then replace every place I say 'primary key' below with 'clustered index'.bhas no primary key but does have an auto-incrementing int calledRowKeyEvery time you do a join between
aandb, SQL server will need to read the entirety ofbbecause it has no way to know which rows are the relevant ones. To do 1 billion rows @ 10,000 rows per batch, you'll need to run 100,000 batches - and each one has a full read ofb.This is slowing you down. Instead, treat
bas you would any other table with regards to indexing and query speed.Using the same approach you are using (e.g., batching etc), you can reduce the extra scan time with the following
bwith a single seekbtoaona's primary key/clustered index - and preferably withbalready sorted in the same way.Step 1 - big improvement
Given that the filtering mechanism you're using RowKey, you want your data sorted (indexed) on RowKey so it's easy to identify the relevant 10,000 you want.
You can either create the temp table first with a primary key/clustered index; or you could create it afterwards. To create it afterwards, use something like the following.
Step 2 - more improvement
It would be somewhat faster if you could ensure that RowKey is in the same order as the clustered index of
a(e.g., HeaderId). This means that the number of writes tobare minimised (instead of jumping around everywhere to write the data, instead it is grouped together on the data pages).You could ensure the sort by removing the IDENTITY on RowKey, and instead make it NULL to start with (e.g.,
CAST(NULL AS int) AS RowKey), and then update it as ROW_NUMBER() ordered by the clustered index ofae.g.,Note that it makes a lot of sense to do this before creating the clustered index above.