Deleting records from target table, if the rows are deleted from source using Pentaho Data Integration

1.3k views Asked by At

I'm trying to delete records from Target table if it doesn't exist in source. I'm using PDI as an ETL tool. There are around 10M records in source. Wouldn't it be time consuming if the key looks up for a match in the entire table every day and then delete if it's not found. How can this be achieved in the most efficient manner.

Please can you explain in detail as I'm quite new to this tool

1

There are 1 answers

0
Cyrus On

Yes, it will take a long time for a large amount of data. You can however optimize the process by taking advantage of the optimizations the databases can do.

A typical way to do this is to create a transformation with two table input steps and a merge (diff) step. That step takes two input streams, matches the keys and then compares any other fields you specify, adding a flag to the output stream with "new","deleted","changed" or "identical". For your case, you only care about "deleted".

If you want to check existence of records, you only need the primary/unique keys, no other fields. This will allow the source database to use the index and not even touch the table itself. The same goes for getting the records from the target table, though obviously you will do changes there later.

The table inputs should both select the primary/unique key columns and order by them. You may have to sort the rows within Pentaho if the source databases are different and you have a non-numeric key, but I will assume some kind of integer or bigint.

Connect the two streams to a merge (diff) step. The target table is the reference (old state), the source table is the compare (new state). Use the Get Fields button to get the key fields and compare fields. It should all just be your key columns.

Add a filter rows step and filter by flagfield = "deleted". Add two dummy steps and connect them as true and false. Now run the transformation and verify that the rows going to "true" are indeed present in the target, but not in the source anymore.

After you've tested this and confirmed it gives the correct rows, you would want to use a "Delete" step to delete the target table records. This may cause a deadlock if you do it directly in the same transformation, since you're both reading and deleting from the target table. So instead you can add a "Copy rows to result" step that will pass the rows to the parent job.

In the parent job you make a second transformation. That one only needs two steps: Get rows from result and Delete.