I have 2 users table in different databases and I would like to get only unique rows from one those tables.
On the following example, I need the list of emails which have no duplicate name.
I am using Pentaho DI Kettle.
Table Users from database 1
ID | Name | Email
--- | ----------- | -------------
1 | Jonh Snow | [email protected]
--- | ----------- | -------------
2 | Sansa Stark | [email protected]
--- | ----------- | -------------
3 | Ayra Stark | [email protected]
Table Users from database 2
ID | Name | Email
--- | ----------- | -------------
1 | Jonh Stott | [email protected]
--- | ----------- | -------------
2 | Jonh Jonh | [email protected]
--- | ----------- | -------------
3 | Ayra Stark | [email protected]
Desired Result
ID | Name | Email
--- | ----------- | -------------
1 | Jonh Snow | [email protected]
--- | ----------- | -------------
2 | Sansa Stark | [email protected]
As far as I understand your question, you need to keep only the emails which are not duplicates in DB1 union DB2?
Well, follow your logic: get the data in (with one
Input table
by DB connection), count the number of records per emails (Memory Group by
) andFilter
out the emails with a count greater than 1.Use the
Memory Group by
, which do not requires sorting. In theGroup field
put the key:email
. And in theAggregates
put theNumber of rows
(in the Type drop down), and theFirst Value
(orLast Value
) ofName
otherwise this column will disappear from the stream.And
Add a sequence
if you need to create the ID on the output.