Remove duplicated row from 2 tables in different database

611 views Asked by At

I have 2 users table in different databases and I would like to get only unique rows from one those tables.

On the following example, I need the list of emails which have no duplicate name.

I am using Pentaho DI Kettle.

Table Users from database 1

ID  | Name        | Email
--- | ----------- | -------------
1   | Jonh Snow   | [email protected]  
--- | ----------- | -------------
2   | Sansa Stark | [email protected]  
--- | ----------- | -------------
3   | Ayra Stark  | [email protected]  

Table Users from database 2

ID  | Name        | Email
--- | ----------- | -------------
1   | Jonh Stott  | [email protected]  
--- | ----------- | -------------
2   | Jonh Jonh   | [email protected]  
--- | ----------- | -------------
3   | Ayra Stark  | [email protected]  

Desired Result

ID  | Name        | Email
--- | ----------- | -------------
1   | Jonh Snow   | [email protected]  
--- | ----------- | -------------
2   | Sansa Stark | [email protected]
2

There are 2 answers

0
AlainD On

As far as I understand your question, you need to keep only the emails which are not duplicates in DB1 union DB2?

Well, follow your logic: get the data in (with one Input table by DB connection), count the number of records per emails (Memory Group by) and Filter out the emails with a count greater than 1.

Use the Memory Group by, which do not requires sorting. In the Group field put the key: email. And in the Aggregates put the Number of rows (in the Type drop down), and the First Value (or Last Value) of Name otherwise this column will disappear from the stream.

And Add a sequence if you need to create the ID on the output.

enter image description here

0
Smrat Srivastava On
  1. Perform an UNION ALL ( simply join the two files to an dummy)
  2. Perform an Sort on email.
  3. use UNIQUE rows on Name.
  4. use a stream lookup on Name having one input as table1 and second unique rows.
  5. Filter rows on id < 3 and id isnull.