Google Dataprep/Trifacta - Joining three datasets, de-duping but maintaining not matched records

271 views Asked by At

This feels like it should be more straightforward than I am finding it.

I have three datasets/lists of customers with some overlapping data, but some customers who exist only in each of the lists.

List A: 500k records List B: 80k records List C: 70k records

The unique identifier/key for each is email address.

I have tried various ways of joining these - left outer, right outer, outer...

My latest attempt resulting in the largest list is to start with List A, then left joining the other two lists using the email address as the key.

But this way, I only seem to finish up with a joined list of 500k of the common records...

I feel as though it is dropping any new, unique users from List B and List C if they don't already exist in List A.

How can I join all three, achieving the de-duplication but also adding new rows for the uncommon users from B and C?

Thanks,

P

1

There are 1 answers

0
Atybzz On

You can solve this using the following recipe:

1) Since you have overlapping data across the three datasets, take a union of the columns which are common across the lists.

2) Since the union should produce some duplicate rows, you can then remove duplicate rows which would result in each email being present once.

3) Lastly, you can then do a left join against each list to add the columns which are unique to each list.