Deduplicate Delta Lake Table

Question

Deduplicate Delta Lake Table

4.4k views Asked by Rich At 13 October 2020 at 15:40

I have a Delta Lake table in Azure. I'm using Databricks. When we add new entries we use merge into to prevent duplicates from getting into the table. However, duplicates did get into the table. I'm not sure how it happened. Maybe the merge into conditions weren't setup properly.

However it happened the duplicates are there. Is there any way to detect and remove the duplicates from the table? All the documentation I've found shows how to deduplicate the dataset before merging. Nothing for once the duplicates are already there. How can I remove the duplicates?

Thanks

Original Q&A

There are 4 answers

**Arjoon** · Answer 1 · 2020-10-13T20:49:14+00:00

If the duplicate exists in the target table, your only options are:

Delete the duplicated rows from the target table manually using SQL DELETE statements
Create a deduplicated replica of your target table and rename both tables (dedupped replica and original target) to ensure make your dedupped replica the main table.

**Elisabetta** · Answer 2 · 2022-06-01T14:23:39+00:00

I would suggest the following SOP:

Fix existing job (streamer or batch) to handle duplicates
Change job configuration to write into _recovery table (also change a checkpoint path to _recovery in case of streamer job)
Run the job and validate its output
Rename the original folder in _backup and rename the _recovery to original (do the same with the checkpoints directory)
Restore the original job configuration.

**Ehab** · Answer 3 · 2021-07-21T02:45:30+00:00

Ehab On 21 July 2021 at 02:45

you can use dataset.dropDuplicates to delete duplicates based on columns.

https://spark.apache.org/docs/2.3.2/api/java/org/apache/spark/sql/Dataset.html#dropDuplicates-java.lang.String-java.lang.String...-

**Nikunj Kakadiya** · Answer 4 · 2021-07-22T11:09:51+00:00

In order to remove the duplicates you can follow the below approach:

Create a separate table that is the replica of the table that has duplicate records.
Drop the first table that has duplicate records. (Meta data information plus physical files)
write a python script or scala code to remove the duplicate records either using dropDuplicates function or any custom logic that defines a unique record by reading the data from the table that you created in step 1 and recreate the table that you deleted in step 2.

Once you follow the above steps your table would not have duplicate rows but this is just a workaround to make your table consistent so it does not have duplicate records and not a permanent solution.

Before or after you follow the above steps you will have to look into your merge into statements to see if that is written correctly so that it does not insert duplicate records. If the merge into statement is proper make sure that the dataset that you are processing is not having duplicate records from the source from where you are reading the data.

TechQA.

Deduplicate Delta Lake Table

There are 4 answers

Related Questions in APACHE-SPARK

Related Questions in DATABRICKS

Related Questions in DELTA-LAKE

Popular Questions

Popular Tags

Trending Questions