How do I write a batch process command using gnu parallel?

Question

How do I write a batch process command using gnu parallel?

584 views Asked by SkV At 14 October 2021 at 20:45

I'm trying to do some batch processing using a package called ocrmypdf.

Here is a command that can process 1 pdf file

ocrmypdf input.pdf output.pdf

and here is a command that can process all pdf files in the directory we run it in.

parallel --tag -j 2 ocrmypdf '{}' 'output/{}' ::: *.pdf

Now, I actually want to run this command for all pdf files in the directory. This one takes one more parameter.

ocrmypdf --sidecar txt/input.txt input.pdf out/output.pdf

I tried rewriting the parallel command earlier like this:

parallel --tag -j 2 ocrmypdf --sidecar txt/{}.txt {}.pdf out/{}.pdf ::: *.pdf

But I get the error:

ocrmypdf: error: the following arguments are required: output_pdf

Can someone help me understand what I'm doing wrong? Thanks!

Original Q&A

There are 2 answers

**John Collins** · Answer 1 · 2021-10-14T22:28:15+00:00

Try:

parallel --tag -j 2 ocrmypdf --sidecar txt/{.}.txt {} out/{} ::: *.pdf

The .pdf's after the curly brackets (i.e. "{}.pdf") are extraneous and will result in inability to locate the input file(s) ("{}" captures the extension as well by default), and for the text one, by adding the period inside the brackets, that auto-removes the extension so you'll end up with "....txt" instead of "....pdf.txt" files (where "..." = identical filenames matching the inputs)

If the above doesn't work, likely due to having filenames with whitespaces in them, or some other characters messing with parallel's parsing (like quote(s) characters in the filename, parentheses, etc.), instead try using a file as the input:

Troubleshooting Solution - Create a File as the Input to `parallel`

I believe this should work. To avoid the fuss with quotes, I first created a file with the names of all the pdfs (full relative paths from cwd):

[g]ls --color=none *.pdf | parallel -q printf '%s'\\n {} > ocrmypdf.list

or

[g]ls --color=none -N *.pdf > ocrmypdf.list

The important thing is that no single quotes are introduced in the printed filenames in the .list file -- the quoting should be "literal", e.g.:

like this:

Tritone Substitution sheet music.pdf

not like this:

'Tritone Substitution sheet music.pdf'

Then you can run the parallel ocrmypdf command, like so:

parallel -j 2 ocrmypdf --sidecar txt/{.} {} out/{} :::: ocrmypdf.list

Also notice the 4 :::: vs usual three, because it's reading from a file. This will default to one full filename argument per line ran in parallel, so, no worries if there are spaces etc in the pdf filenames in the input file.

**Ole Tange** · Answer 2 · 2021-10-18T06:53:27+00:00

This works for me:

parallel --tag -j 2 ocrmypdf --sidecar txt/{.}.txt {} out/{} ::: *.pdf

If it does not work for you:

Identify a failing file
Run the failing file by hand to check that this works
Edit your question to include a link to the failing file

(Also be aware of this bug when running multiple tesseracts: https://github.com/tesseract-ocr/tesseract/issues/3109#issuecomment-703845274)

TechQA.

How do I write a batch process command using gnu parallel?

There are 2 answers

Troubleshooting Solution - Create a File as the Input to `parallel`

Then you can run the parallel ocrmypdf command, like so:

Related Questions in PYTHON-3.X

Related Questions in BASH

Related Questions in PARALLEL-PROCESSING

Related Questions in GNU-PARALLEL

Related Questions in OCRMYPDF

Popular Questions

Popular Tags

Trending Questions

How do I write a batch process command using gnu parallel?

There are 2 answers

Troubleshooting Solution - Create a File as the Input to parallel

Then you can run the parallel ocrmypdf command, like so:

Related Questions in PYTHON-3.X

Related Questions in BASH

Related Questions in PARALLEL-PROCESSING

Related Questions in GNU-PARALLEL

Related Questions in OCRMYPDF

Popular Questions

Popular Tags

Trending Questions

Troubleshooting Solution - Create a File as the Input to `parallel`