How to drop duplicates considering only subset of columns?

Question

How to drop duplicates considering only subset of columns?

3.6k views Asked by Naveen Yadav At 06 July 2018 at 10:54

I use Spark 1.6 and am doing inner join on two dataframes as follows:

val filtergroup = metric
  .join(filtercndtns, Seq("aggrgn_filter_group_id"), inner)
  .distinct()

But I keep getting duplicate values in aggrgn_filter_group_id column. Can you please suggest some solution?

Original Q&A

There are 1 answers

**Jacek Laskowski** · Accepted Answer · 2018-07-08T17:42:44+00:00

Spark < 2.0

Consider distinct on a dataset with the column(s) to drop duplicates on followed by an inner join on the column(s).

// don't use distinct yet
val filtergroup = metric
  .join(filtercndtns, Seq("aggrgn_filter_group_id"), "inner")

// take unique aggrgn_filter_group_ids
val uniqueFilterGroups = filtergroup
  .select("aggrgn_filter_group_id")
  .distinct

// Inner join to remove duplicates from the source dataset
filtergroup.join(uniqueFilterGroups, Seq("aggrgn_filter_group_id"), "inner")

The price is to execute an extra select with distinct and join, but should give you the expected result.

Spark >= 2.0

The following solution will only work with Spark 2.0+ that came out with support for dropDuplicates operators and allows for dropping duplicates considering only a subset of columns.

Quoting the documentation:

distinct(): Dataset[T] Returns a new Dataset that contains only the unique rows from this Dataset. This is an alias for dropDuplicates.

distinct or dropDuplicates simply drop the row duplicates comparing every column.

If you're interested in a specific column, you should use one of the dropDuplicates, e.g.

val filtergroup = metric
  .join(filtercndtns, Seq("aggrgn_filter_group_id"), "inner")
  .dropDuplicates("aggrgn_filter_group_id")

When you specify a column or a set of columns, dropDuplicates returns a new Dataset with duplicate rows removed, considering only the subset of columns.

TechQA.

How to drop duplicates considering only subset of columns?

There are 1 answers

Spark < 2.0

Spark >= 2.0

Related Questions in SCALA

Related Questions in APACHE-SPARK-SQL

Related Questions in APACHE-SPARK-1.6

Popular Questions

Trending Questions