How do I write a batch process command using gnu parallel?

582 views Asked by At

I'm trying to do some batch processing using a package called ocrmypdf.

Here is a command that can process 1 pdf file

ocrmypdf input.pdf output.pdf

and here is a command that can process all pdf files in the directory we run it in.

parallel --tag -j 2 ocrmypdf '{}' 'output/{}' ::: *.pdf

Now, I actually want to run this command for all pdf files in the directory. This one takes one more parameter.

ocrmypdf --sidecar txt/input.txt input.pdf out/output.pdf

I tried rewriting the parallel command earlier like this:

parallel --tag -j 2 ocrmypdf --sidecar txt/{}.txt {}.pdf out/{}.pdf ::: *.pdf

But I get the error:

ocrmypdf: error: the following arguments are required: output_pdf

Can someone help me understand what I'm doing wrong? Thanks!

2

There are 2 answers

6
John Collins On

Try:

parallel --tag -j 2 ocrmypdf --sidecar txt/{.}.txt {} out/{} ::: *.pdf

The .pdf's after the curly brackets (i.e. "{}.pdf") are extraneous and will result in inability to locate the input file(s) ("{}" captures the extension as well by default), and for the text one, by adding the period inside the brackets, that auto-removes the extension so you'll end up with "....txt" instead of "....pdf.txt" files (where "..." = identical filenames matching the inputs)

If the above doesn't work, likely due to having filenames with whitespaces in them, or some other characters messing with parallel's parsing (like quote(s) characters in the filename, parentheses, etc.), instead try using a file as the input:

Troubleshooting Solution - Create a File as the Input to parallel

I believe this should work. To avoid the fuss with quotes, I first created a file with the names of all the pdfs (full relative paths from cwd):

[g]ls --color=none *.pdf | parallel -q printf '%s'\\n {} > ocrmypdf.list

or

[g]ls --color=none -N *.pdf > ocrmypdf.list

The important thing is that no single quotes are introduced in the printed filenames in the .list file -- the quoting should be "literal", e.g.:

like this:

Tritone Substitution sheet music.pdf

not like this:

'Tritone Substitution sheet music.pdf'

Then you can run the parallel ocrmypdf command, like so:

parallel -j 2 ocrmypdf --sidecar txt/{.} {} out/{} :::: ocrmypdf.list

Also notice the 4 :::: vs usual three, because it's reading from a file. This will default to one full filename argument per line ran in parallel, so, no worries if there are spaces etc in the pdf filenames in the input file.

0
Ole Tange On

This works for me:

parallel --tag -j 2 ocrmypdf --sidecar txt/{.}.txt {} out/{} ::: *.pdf

If it does not work for you:

  • Identify a failing file
  • Run the failing file by hand to check that this works
  • Edit your question to include a link to the failing file

(Also be aware of this bug when running multiple tesseracts: https://github.com/tesseract-ocr/tesseract/issues/3109#issuecomment-703845274)