I use Spark 1.6 and am doing inner join on two dataframes as follows:
val filtergroup = metric
.join(filtercndtns, Seq("aggrgn_filter_group_id"), inner)
.distinct()
But I keep getting duplicate values in aggrgn_filter_group_id column. Can you please suggest some solution?
Spark < 2.0
Consider
distincton a dataset with the column(s) to drop duplicates on followed by an inner join on the column(s).The price is to execute an extra
selectwithdistinctandjoin, but should give you the expected result.Spark >= 2.0
The following solution will only work with Spark 2.0+ that came out with support for
dropDuplicatesoperators and allows for dropping duplicates considering only a subset of columns.Quoting the documentation:
distinctordropDuplicatessimply drop the row duplicates comparing every column.If you're interested in a specific column, you should use one of the
dropDuplicates, e.g.When you specify a column or a set of columns,
dropDuplicatesreturns a new Dataset with duplicate rows removed, considering only the subset of columns.