Why Databricks Delta is copying unmodified rows even when merge doesn't update anything?

Question

Why Databricks Delta is copying unmodified rows even when merge doesn't update anything?

1.2k views Asked by Jakub Troszok At 06 October 2020 at 09:37

When I run a following query:

merge into test_records t
using (
select id, "senior developer" title, country from test_records where country = 'Brazil'
) u
on t.id = u.id
when matched and (t.id <> u.id) then -- this is just to be sure that nothing will get updated
  update set t.title = u.title, t.updated_at = now()
when not matched then 
  insert (id, title, country, created_at, updated_at) values (id, title, country, now(), now());

I still see the following data when I run describe history of the target table:

{"numTargetRowsCopied": "2", "numTargetRowsDeleted": "0", "numTargetFilesAdded": "1", "numTargetRowsInserted": "0", "numTargetRowsUpdated": "0", "numOutputRows": "2", "numSourceRows": "2", "numTargetFilesRemoved": "1"}

And in the spark ui i see this:

So the unmodified rows are being rewritten without any (?) reason. Why is that?

Original Q&A

There are 1 answers

**Jacek Laskowski** · Accepted Answer · 2020-10-12T20:16:52+00:00

DISCLAIMER: I've been looking into it and I can only give you how the code looks, but not why it does so.

This no-match-row MERGE INTO is resolved to MergeIntoCommand logical command (that is executed on the driver). You can find all the performance metrics here (with numTargetRowsCopied).

That leads us to writeAllChanges with our metric

What's interesting about this code is that it chooses the join type to be rightOuter or fullOuter. Turn DEBUG logging level on for org.apache.spark.sql.delta.commands.MergeIntoCommand logger to see the internal messages printed out to the logs.

Another very interesting thing is that the metrics are calculated as UDFs (!) Ours is here.

In the end, these lines are where the UDF gets executed and the metric incremented. The comment is of particular interest:

// Target row did not match any source row, so just copy it to the output

I think that says it all. No match leads to incrementing numTargetRowsCopied metric. I'm guessing that you've got two rows in the target table, don't you?

TechQA.

Why Databricks Delta is copying unmodified rows even when merge doesn't update anything?

There are 1 answers

Related Questions in APACHE-SPARK-SQL

Related Questions in DATABRICKS

Related Questions in DELTA-LAKE

Popular Questions

Popular Tags

Trending Questions