I recently upgraded from ocrmypdf 9.0.3/tesseract 4.x to ocrmypdf 13.4.1/tesseract 5.1.
When using ocrmypdf 9.x or 13.x, this works on on the cli:
$ ocrmypdf --output-type pdf sample-file.pdf output-file.pdf
However, when I use the API within my app,
import ocrmypdf
ocrmypdf.ocr("path/to/inputfile.pdf", "path/to/outputfile.pdf", output_type="pdf")
The text layers are added only when I use ocrmypdf 9.x and no text is searchable when I use 13.4.1.
However, if I use:
ocrmypdf.ocr("inputfile.pdf", "outputfile.pdf", output_type="pdfa")
then appropriate text layers are set when using either 9.x or 13.4.1
I feel like I'm missing something very basic... any help here?
This turned out to be a non-issue.
There was a post-processing step involved that subsequently changed the output.
13.4.x works fine.