PDFBox: extracting images from pdf to inputstream

6.2k views Asked by At

I am using PDFBox to extract the images from my pdf (which contains only jpg's).

Since I will save those images inside my database, I would like to directly convert each image to an inputstream object first without placing the file temporary on my file sysem. I am facing difficulties with this however. I think it has to do because of the use of image.getPDFStream().createInputStream() as I did in the following example:

while (imageIter.hasNext()) {
    String key = (String) imageIter.next();
    PDXObjectImage image = (PDXObjectImage) images.get(key);

    FileOutputStream output = new FileOutputStream(new File(
            "C:\\Users\\Anton\\Documents\\lol\\test.jpg"));
    InputStream is = image.getPDStream().createInputStream(); //this gives me a corrupt file
    byte[] buffer = new byte[1024];
    while (is.read(buffer) > 0) {
        output.write(buffer);
    }
}

However this works:

while (iter.hasNext()) {
    PDPage page = (PDPage) iter.next();
    PDResources resources = page.getResources();
    Map<String, PDXObject> images = resources.getXObjects();
        if (images != null) {
            Iterator<?> imageIter = images.keySet().iterator();
            while (imageIter.hasNext()) {
            String key = (String) imageIter.next();
            PDXObjectImage image = (PDXObjectImage) images.get(key);
            image.write2file(new File("C:\\Users\\Anton\\Documents\\lol\\test.jpg")); //this works however
        }
    }
}

Any idea how I can convert each PDXObjectImage (or any other object I can get) to an inputstream?

3

There are 3 answers

1
Tilman Hausherr On BEST ANSWER

In PDFBox 1.8, the easiest way is to use write2OutputStream(), so your first code block would now look like this:

while (imageIter.hasNext()) {
    String key = (String) imageIter.next();
    PDXObjectImage image = (PDXObjectImage) images.get(key);

    FileOutputStream output = new FileOutputStream(new File(
            "C:\\Users\\Anton\\Documents\\lol\\test.jpg"));
    image.write2OutputStream(output);
}

advanced solution, as long as you're really sure you have only JPEGs that display properly, i.e. have no unusual colorspace:

while (imageIter.hasNext()) {
    String key = (String) imageIter.next();
    PDXObjectImage image = (PDXObjectImage) images.get(key);

    FileOutputStream output = new FileOutputStream(new File(
            "C:\\Users\\Anton\\Documents\\lol\\test.jpg"));
    InputStream is = image.getPDStream().getPartiallyFilteredStream(DCT_FILTERS);
    byte[] buffer = new byte[1024];
    while (is.read(buffer) > 0) {
        output.write(buffer);
    }
}

The second solution removes all filters except the DCT (= JPEG) filter. Some older PDFs have several filters, e.g. ascii85 and DCT.

Now even if you created the image with JPEGs, you don't know what your PDF creation software did. One way to find out what type of image it is, is to check what class it is (use instanceof):

- PDPixelMap => PNG
- PDJpeg => JPEG
- PDCcitt => TIF

Another way is to use image.getSuffix().

0
Gregor Lah On

PDXObjectImage has method write2OutputStream(OutputStream out) from which you can then get either byte array out of output stream.

Check How to convert OutputStream to InputStream? for converting OutputStream to InputStream.

3
Carles Xavier On

If you are using PDFBox 2.0.0 or above

PDDocument document = PDDocument.load(new File("filePath")); //filePath is the path to your .pdf
PDFRenderer pdfRenderer = new PDFRenderer(document);

for(int i=0; i<document.getPages().getCount(); i++){
    BufferedImage bim = pdfRenderer.renderImage(i, 1.0f, ImageType.RGB); //Get bufferedImage for page "i" with scale 1
    ByteArrayOutputStream os = new ByteArrayOutputStream();
    ImageIO.write(bim, "jpg", os);
    InputStream is = new ByteArrayInputStream(os.toByteArray());
    //Do whatever you need with the inputstream
}
document.close()