I'm trying to do some batch processing using a package called ocrmypdf.
Here is a command that can process 1 pdf file
ocrmypdf input.pdf output.pdf
and here is a command that can process all pdf files in the directory we run it in.
parallel --tag -j 2 ocrmypdf '{}' 'output/{}' ::: *.pdf
Now, I actually want to run this command for all pdf files in the directory. This one takes one more parameter.
ocrmypdf --sidecar txt/input.txt input.pdf out/output.pdf
I tried rewriting the parallel command earlier like this:
parallel --tag -j 2 ocrmypdf --sidecar txt/{}.txt {}.pdf out/{}.pdf ::: *.pdf
But I get the error:
ocrmypdf: error: the following arguments are required: output_pdf
Can someone help me understand what I'm doing wrong? Thanks!
Try:
The .pdf's after the curly brackets (i.e. "
{}.pdf
") are extraneous and will result in inability to locate the input file(s) ("{}
" captures the extension as well by default), and for the text one, by adding the period inside the brackets, that auto-removes the extension so you'll end up with "....txt
" instead of "....pdf.txt
" files (where "..." = identical filenames matching the inputs)If the above doesn't work, likely due to having filenames with whitespaces in them, or some other characters messing with parallel's parsing (like quote(s) characters in the filename, parentheses, etc.), instead try using a file as the input:
Troubleshooting Solution - Create a File as the Input to
parallel
I believe this should work. To avoid the fuss with quotes, I first created a file with the names of all the pdfs (full relative paths from cwd):
or
The important thing is that no single quotes are introduced in the printed filenames in the
.list
file -- the quoting should be "literal", e.g.:like this:
not like this:
Then you can run the parallel ocrmypdf command, like so:
Also notice the 4
::::
vs usual three, because it's reading from a file. This will default to one full filename argument per line ran in parallel, so, no worries if there are spaces etc in the pdf filenames in the input file.