Continue stream based on result of a DB query

499 views Asked by At

I'm developing an ETL. The first step is a Text File Input, that adds to the stream some metadata from "Additional output fields", including filename and last modified.

I must make a query to DB verifying if that filename with that last modified datetime has already been processed. If so, the stream must stop and next steps must not be processed.

Is that possible? I've googled about it and found no example.

2

There are 2 answers

0
jfneis On

Pentaho processes all steps in parallel, so this kind of linear abstraction can be a little confusing.

What you need to do is to return no rows if don't want to continue your processing. If subsequent steps receive 0 rows, they will do nothing.

There are some ways to do this:

  1. Build your SQL query in a way that it just returns a row if there is NO FILE with the same name
  2. Use a Switch step + Detect Empty Stream + Filter rows to ensure that only when you have no result from your query a row will go to next steps
0
AlainD On

The issue you'll face very soon is: "What if I try to start my transformation when it is already running?". This sounds like an academic question until your crontab starts a deadlock or you users complain about you a double count on the data of three weeks ago.

You need to check this in a job. Contrary to transformation in which all steps run in parallel, the job entries run one after the other. You can use a Scripting/SQL entry and put a condition such as logdate<now()-10minutes like explained in this question.

Now, there a more robust architecture for your spec: put all the incoming file in a specific folder, process them then move them to an archive directory. In facts, move them in the archive just before processing, protecting from duplicate load.