NiFi-1.0 - content_repo & flowfile_repo

81 views Asked by At

I have a flow, pretty big, which takes a csv and then eventually converts it to sql statements (via avro, json). For a file of 5GB, flowfile_repo (while processing) went up to 24 GB and content_repo to 18 GB.

  • content_repo max 18 GB
  • flowfile_repo max 26 GB

Is there a way to predict how much space would I need for processing N files ? Why it takes so much space ?

1

There are 1 answers

0
Bryan Bende On BEST ANSWER

The flow file repo is check-pointed every 2 minutes by default, and is storing the state of every flow file as well as the attributes of every flow file. So it really depends how many flow files and how many attributes per flow file are being written during that 2 min window, as well as how many processors the flow files are passing through and how many of them are modifying the attributes.

The content repo is storing content claims, where each content claim contains the content of one or more flow files. Periodically there is a clean up thread that runs and determines if a content claim can be cleaned up. This is based on whether or not you have archiving enabled. If you have it disabled, then a content claim can be cleaned up when no active flow files reference any of the content in that claim.

The flow file content also follows a copy-on-write pattern, meaning the content is immutable and when a processor modifies the content it is actually writing a new copy. So if you had a 5GB flow file and it passed through a processor that modified the content like ReplaceText, it would write another 5GB to the content repo, and the original one could be removed based on the logic above about archiving and whether or not any flow files reference that content.

If you are interested in more info, there is an in depth document about how all this works here:

https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html