I need to process many PDF-Files. So I have a list of files (files that are in some folder or zip file). I want a subtask per PDF. Then I create a subtask per page, so it can be processed.
I was thinking of using a fork/join pool but that just keeps creating more subtasks to read more files and I run out of memory.
Sometimes I get many small files, sometimes I get large files with many pages. It makes no sense loading more documents when there are already many pages queued up to be processed.
- Each pdf file from a folder is read and a subtask (2) is created, forked, and joined.
- For each page a subtask (3) is created, forked, and joined.
- Process this page.
There's ForkJoinTask.helpQuiesce()
, which might be good enough in some situations. I can just call ForkJoinTask.helpQuiesce()
after creating some subtasks. This way the subtasks are more likely to be processed before more data is loaded.
But I can't find anything to set the priority of a subtask. Wouldn't that be a lot easier? If I understand the documentation correctly, there is one submission queue and then one task queue per worker thread. Is there no way to control which tasks from the submission queue are processed first? I can pass a factory for the worker threads, but not for the submission queue.
Like in the divide-and-conquer metaphor: It might make more sense to plunder all cities before you invade a new country or even a new continent, so you get enough resources needed for those tasks. But how is this controlled?
I know Fork/Join uses work stealing and you usually don't have to bother. But I need to build a batch processing tool and I can't have it just load gigabytes of data to memory before it even begins processing any of the pages. But I don't need some framework like hadoop for a bunch of pdf files. That would be overkill.
I could use a PriorityQueue<E>
, but that seems to be a lot more work as this is only a simple data structure, while Fork/Join is a framework.
Is there no way of controlling the order in which tasks are processed? What am I missing? Is there some other priority queue based solution available in Java?