How to ignore scanned image in tika

Question

How to ignore scanned image in tika

887 views Asked by pramesh At 09 September 2020 at 15:24

I'm trying to parse pdf files in tika. In some handwritten scanned documents, tika is parsing the file and returning garbage text that does not make sense. I'm using python tika wrapper from here. Is there some way to ignore pdfs that contain images. Tesseract OCR parser is turned off. It is not displayed in metadata after parsing the file.

Original Q&A

There are 1 answers

**marek.kapowicki** · Answer 1 · 2020-09-23T15:52:00+00:00

to ignore the inline images you should use the flag "X-Tika-PDFextractInlineImages: false"

pdfParserConfig.setExtractInlineImages(false)

but to be honest setting the value to false has sense only for the "native pdf"

for the scanned documents this flag has to be set to true than only way to improve the process is to turn on the ocr and use the OcrStrategy: OCR_ONLY

TechQA.

How to ignore scanned image in tika

There are 1 answers

Related Questions in PYTHON-3.X

Related Questions in APACHE-TIKA

Related Questions in TIKA-SERVER

Popular Questions

Trending Questions