ocrmypdf 13.4.1 command line works, but API missing text layers when using output_type="pdf"

249 views Asked by At

I recently upgraded from ocrmypdf 9.0.3/tesseract 4.x to ocrmypdf 13.4.1/tesseract 5.1.

When using ocrmypdf 9.x or 13.x, this works on on the cli:

$ ocrmypdf --output-type pdf sample-file.pdf output-file.pdf

However, when I use the API within my app,

import ocrmypdf

ocrmypdf.ocr("path/to/inputfile.pdf", "path/to/outputfile.pdf", output_type="pdf")

The text layers are added only when I use ocrmypdf 9.x and no text is searchable when I use 13.4.1.

However, if I use:

ocrmypdf.ocr("inputfile.pdf", "outputfile.pdf", output_type="pdfa")

then appropriate text layers are set when using either 9.x or 13.4.1

I feel like I'm missing something very basic... any help here?

1

There are 1 answers

0
William On

This turned out to be a non-issue.

There was a post-processing step involved that subsequently changed the output.
13.4.x works fine.