When I run a following query:
merge into test_records t
using (
select id, "senior developer" title, country from test_records where country = 'Brazil'
) u
on t.id = u.id
when matched and (t.id <> u.id) then -- this is just to be sure that nothing will get updated
update set t.title = u.title, t.updated_at = now()
when not matched then
insert (id, title, country, created_at, updated_at) values (id, title, country, now(), now());
I still see the following data when I run describe history of the target table:
{"numTargetRowsCopied": "2", "numTargetRowsDeleted": "0", "numTargetFilesAdded": "1", "numTargetRowsInserted": "0", "numTargetRowsUpdated": "0", "numOutputRows": "2", "numSourceRows": "2", "numTargetFilesRemoved": "1"}
And in the spark ui i see this:
So the unmodified rows are being rewritten without any (?) reason. Why is that?
DISCLAIMER: I've been looking into it and I can only give you how the code looks, but not why it does so.
This no-match-row MERGE INTO is resolved to MergeIntoCommand logical command (that is executed on the driver). You can find all the performance metrics here (with numTargetRowsCopied).
That leads us to writeAllChanges with our metric
What's interesting about this code is that it chooses the join type to be
rightOuter
orfullOuter
. TurnDEBUG
logging level on fororg.apache.spark.sql.delta.commands.MergeIntoCommand
logger to see the internal messages printed out to the logs.Another very interesting thing is that the metrics are calculated as UDFs (!) Ours is here.
In the end, these lines are where the UDF gets executed and the metric incremented. The comment is of particular interest:
I think that says it all. No match leads to incrementing
numTargetRowsCopied
metric. I'm guessing that you've got two rows in the target table, don't you?