PDFBox PDFMergerUtility: how do I tell which sources failed?

1.8k views Asked by At

So, I'm doing this:

PDFMergerUtility mergePdf = new PDFMergerUtility();

for (int i = 0; i < filePaths.size(); i++) 
    mergePdf.addSource(filePaths.get(i));

mergePdf.setDestinationFileName(tempFile.getAbsolutePath()); 
mergePdf.mergeDocuments();

Which works great until an exception is thrown on a PDF it can't parse (either corrupt PDF or something PDFBox can't handle). It doesn't happen very often.

I would like to be able to tell which source(s) it failed on, exclude them in a subsequent merge and tell the user which documents failed.

Can this be done?

UPDATE:

Here's my exception:

java.io.IOException: Error: Expected a long type at offset 591535, instead got 'E^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^UZí^KÄ@©¢^X<8d>G §ÑE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^TQE^T<84>f<96><8a>'
    at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1695)
    at org.apache.pdfbox.pdfparser.BaseParser.readObjectNumber(BaseParser.java:1623)
    at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:614)
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1220)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1187)
    at org.apache.pdfbox.util.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:237)
    at org.apache.pdfbox.util.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:194)
    at myapp.util.DocumentImage.combinePDFs(DocumentImage.java:289)
    at myapp.webapp.download.DownloadLatestForCLO.generate(DownloadLatestForCLO.java:73)
    at myapp.webapp.download.DownloadLatestForCLO.getFileSize(DownloadLatestForCLO.java:64)
    at myapp.webapp.download.DownloadServlet.handleRequest(DownloadServlet.java:58)
    at myapp.webapp.download.DownloadServlet.doGet(DownloadServlet.java:32)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:621)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:722)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:305)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:225)
    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123)
    at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472)
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
    at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
    at org.apache.coyote.ajp.AjpProcessor.process(AjpProcessor.java:200)
    at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)
    at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:310)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
1

There are 1 answers

9
PaulG On

Luckily PDFBox is Opensource so having downloaded the latest source (2.00 RC3 at the time of writing) and in the file \pdfbox-2.0.0-RC3\pdfbox\src\main\java\org\apache\pdfbox\multipdf\PDFMergerUtility.java (around line 188)

We can see that it throws this exception up from a lower level and does not catch it and add details of the file that caused the error.

Until this is fixed you will have to catch this error in your code and iterate each of the source files loading and closing them until you find the one(s) that won't be able to be processed and report this yourself.

If you are interested in fixing the problem at source (inside PDFBox) then this is the edit to make and submit to the PDFBox project team. When that fix is incorporated into a build and you upgrade to that version you can safely remove your iteration code:

        try
        {
            MemoryUsageSetting partitionedMemSetting = memUsageSetting != null ? 
                    memUsageSetting.getPartitionedCopy(sources.size()+1) :
                    MemoryUsageSetting.setupMainMemoryOnly();
            Iterator<InputStream> sit = sources.iterator();
            destination = new PDDocument(partitionedMemSetting);

            while (sit.hasNext())
            {
                sourceFile = sit.next();
                source = PDDocument.load(sourceFile, partitionedMemSetting);
                tobeclosed.add(source);
                appendDocument(destination, source);
            }
            if (destinationStream == null)
            {
                destination.save(destinationFileName);
            }
            else
            {
                destination.save(destinationStream);
            }
        }

catch (IOException e) { /* Insert code to place this in an inner exception and throw one including the named 'sourceFile' */ }


        finally
        {
            ....}