Why Databricks Delta is copying unmodified rows even when merge doesn't update anything?

1.2k views Asked by At

When I run a following query:

merge into test_records t
using (
select id, "senior developer" title, country from test_records where country = 'Brazil'
) u
on t.id = u.id
when matched and (t.id <> u.id) then -- this is just to be sure that nothing will get updated
  update set t.title = u.title, t.updated_at = now()
when not matched then 
  insert (id, title, country, created_at, updated_at) values (id, title, country, now(), now());

I still see the following data when I run describe history of the target table:

{"numTargetRowsCopied": "2", "numTargetRowsDeleted": "0", "numTargetFilesAdded": "1", "numTargetRowsInserted": "0", "numTargetRowsUpdated": "0", "numOutputRows": "2", "numSourceRows": "2", "numTargetFilesRemoved": "1"}

And in the spark ui i see this:

enter image description here

So the unmodified rows are being rewritten without any (?) reason. Why is that?

1

There are 1 answers

0
Jacek Laskowski On BEST ANSWER

DISCLAIMER: I've been looking into it and I can only give you how the code looks, but not why it does so.


This no-match-row MERGE INTO is resolved to MergeIntoCommand logical command (that is executed on the driver). You can find all the performance metrics here (with numTargetRowsCopied).

That leads us to writeAllChanges with our metric

What's interesting about this code is that it chooses the join type to be rightOuter or fullOuter. Turn DEBUG logging level on for org.apache.spark.sql.delta.commands.MergeIntoCommand logger to see the internal messages printed out to the logs.

Another very interesting thing is that the metrics are calculated as UDFs (!) Ours is here.

In the end, these lines are where the UDF gets executed and the metric incremented. The comment is of particular interest:

// Target row did not match any source row, so just copy it to the output

I think that says it all. No match leads to incrementing numTargetRowsCopied metric. I'm guessing that you've got two rows in the target table, don't you?