Ideal way of debugging complex NiFi dataflow

3.1k views Asked by At

From what I understood after using NiFi to built some DB ingestion PoCs, the entire dataflow operates as a stream of flow files. And at any particular time, the execution control could be at one or more processors at the same time.

So I am really confused on how to debug a complex dataflow for any failures.

My PoC workflow itself looks like this. nifi-dataflow

And when we will go with production use cases, it can get much more complicated than this. So I have few questions.

  1. How to know the status of the dataflow. If let's say 4 out 10 forked flow files failed at GenerateTableFetch for database pool error, how do I know which ones failed and how to quickly replay them without going to data provenance and doing one by one.

  2. Is there a way to know just by looking at the dataflow that which flowfiles at which processor are failing.

I have lot more doubts / confusions on debugging dataflows with NiFi and if someone can please point me to some doc or share best practices, that would be helpful.

Thanks.

2

There are 2 answers

2
Bryan Bende On

Every processor should have one or more failure relationships. It is up to you decide what to do with a failure... in some cases you can route a failure relationship back to the same processor to keep re-trying, in other cases you could route it to a PutFile processor and write it out to local disk in order to inspect the contents, or you could route it to a PutEmail processor to email someone.

What you don't want to do is auto-terminate the failure relationship because then you are essentially saying you want to ignore it.

1
Up_One On

1- How to know the status of the dataflow. If let's say 4 out 10 forked flow files failed at GenerateTableFetch for database pool error, how do I know which ones failed and how to quickly replay them without going to data governance and doing one by one.

This you manage by having relationships of type failure or any other depending on the type of processor you are using sent to a Process Group to handle Errors.

So like Bryan mentioned you don`t want them to auto-terminate, unless you don't care.

2- Is there a way to know just by looking at the dataflow that which flowfiles at which processor are failing.

Yes - you have to set the "Bulletin level" to distate the level of logs

How to manage your NiFi flows that fail ?

Well you need to be best friends with the BuletinBoard see here SiteToSiteStatusReportingTask or you can use InvokeHttp against the native NiFI Rest Api with a GET call for http://nifi-server:port/nifi-api/flow/bulletin-board and this will Respond with a detailed json object wich can be parsed and then pushed into a PutSlack/PutEmail/PutSNS for any error.

Also is ideal to have Shared Process Group to handle any incoming Error Flow files, this PG will be build with rules and routes to apply to all data flow logic in you NiFi server. Is critical to have PG specific attributes that will be carried with all your flows and will be used down the course of the data flow.

eg:

Process Group "Demo" has a processor called Set PG Attributes that sets the PGName attibute, PGType attibute, FailEmailTitle attribute ,etc. If my flow failes at any point the failure relation will route my failed flow based on the value of one of the Attributes set in the Set PG Attributes processor

Here is a diagram of my current setup, where i have all the failure sent to the same shared PG. enter image description here

Other option

If you think the buletin persisting for 5 min only is an issue then you can use the nifi-app.log, which can be set to be populated by the rules in your /opt/nifi/conf/logback.xml file

  <logger name="org.apache.nifi" level="ERROR"/>
    <logger name="org.apache.nifi.processors" level="DEBUG"/>
    <logger name="org.apache.nifi.processors.standard.LogAttribute" level="ERROR"/>
    <logger name="org.apache.nifi.processors.standard.LogMessage" level="ERROR"/>
    <logger name="org.apache.nifi.controller.repository.StandardProcessSession" level="ERROR" />

So you can have tailFile processor that is looking at you local log file and grabs error information or what ever you think is of use to you and makes some sense out of it.