Is there a way to force Tesseract to do OCR only and leave the original images intact? At the moment, I use the command:
tesseract -l eng file.tif file pdf
in order to produce file.pdf
from a multipage tif file. My problem with this command is that Tesseract modifies the images. For example, thin lines that denote tables or some figures are removed. I'd like to stop this behavior and only OCR the document where the text is underlaid on the original image. In case it matters,
$ tesseract -v
tesseract 3.03
leptonica-1.71
libgif 4.1.6(?) : libjpeg 6b : libpng 1.6.16 : libtiff 4.0.3 : zlib 1.2.8 : libopenjp2 2.1.0
and
$ cat /usr/share/tessdata/configs/pdf
tessedit_create_pdf 1
tessedit_pageseg_mode 1
Using the current git repo of Tesseract, the resulting images look much better. Specifically:
and
with
Basically, all of the lines that used to be eliminated in 3.03 from tables and figures now remain. That being said, the image still is manipulated and the resolution is lower than the original image. Nevertheless, for my purposes, things look ok.