Issue using Apache tika parser when trying to parse pdf having text contains image

1k views Asked by At

I am using these two dependencies:- tika core 2.6.0 tika parser standard package 2.6.0 .Parsing is working fine for these cases:- pdf file with text. pdf file with images. text files and other extensions.

Parsing is failing with pdfparser runtime exception for the use case below:- pdf file with text inside images.

Can someone pls suggest how to resolve failed case here. Thanks

Full error Stack trace:-

org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@2d539b25 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312) ~[org.apache.tika.tika-core-2.6.0.jar:2.6.0] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[org.apache.tika.tika-core-2.6.0.jar:2.6.0] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:175) ~[org.apache.tika.tika-core-2.6.0.jar:2.6.0] Caused by: java.lang.NullPointerException at org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:520) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:786) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:154) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:365) ~[org.apache.pdfbox.pdfbox-2.0.27.jar:2.0.27] at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:137) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1277) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238) ~[org.apache.pdfbox.pdfbox-2.0.27.jar:2.0.27] at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:198) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[org.apache.tika.tika-core-2.6.0.jar:2.6.0] ... 37 more

1

There are 1 answers

4
marek.kapowicki On

You should use different PDFParserConfig There are 2 types of pdfs files

  1. native files (also called searchble) - Tika is able to extract text from native without ocr

    PDFParserConfig pdfParserConfig = new PDFParserConfig();

    pdfParserConfig.setExtractInlineImages(false);

    pdfParserConfig.setOcrStrategy(NO_OCR);

  2. scanned files (or images converted to pdf) - Tika has to do the OCR (using the tesseract under the hood)

    PDFParserConfig pdfParserConfig = new PDFParserConfig();

    pdfParserConfig.setExtractInlineImages(true);

    pdfParserConfig.setOcrStrategy(OCR_ONLY);