PDFBox: extracting images from pdf to inputstream

Question

PDFBox: extracting images from pdf to inputstream

6.2k views Asked by user3125591 At 07 June 2015 at 11:39

I am using PDFBox to extract the images from my pdf (which contains only jpg's).

Since I will save those images inside my database, I would like to directly convert each image to an inputstream object first without placing the file temporary on my file sysem. I am facing difficulties with this however. I think it has to do because of the use of image.getPDFStream().createInputStream() as I did in the following example:

while (imageIter.hasNext()) {
    String key = (String) imageIter.next();
    PDXObjectImage image = (PDXObjectImage) images.get(key);

    FileOutputStream output = new FileOutputStream(new File(
            "C:\\Users\\Anton\\Documents\\lol\\test.jpg"));
    InputStream is = image.getPDStream().createInputStream(); //this gives me a corrupt file
    byte[] buffer = new byte[1024];
    while (is.read(buffer) > 0) {
        output.write(buffer);
    }
}

However this works:

while (iter.hasNext()) {
    PDPage page = (PDPage) iter.next();
    PDResources resources = page.getResources();
    Map<String, PDXObject> images = resources.getXObjects();
        if (images != null) {
            Iterator<?> imageIter = images.keySet().iterator();
            while (imageIter.hasNext()) {
            String key = (String) imageIter.next();
            PDXObjectImage image = (PDXObjectImage) images.get(key);
            image.write2file(new File("C:\\Users\\Anton\\Documents\\lol\\test.jpg")); //this works however
        }
    }
}

Any idea how I can convert each PDXObjectImage (or any other object I can get) to an inputstream?

Original Q&A

There are 3 answers

Gregor Lah On 07 June 2015 at 12:39

PDXObjectImage has method write2OutputStream(OutputStream out) from which you can then get either byte array out of output stream.

Check How to convert OutputStream to InputStream? for converting OutputStream to InputStream.

Carles Xavier On 09 June 2016 at 17:10

If you are using PDFBox 2.0.0 or above

PDDocument document = PDDocument.load(new File("filePath")); //filePath is the path to your .pdf
PDFRenderer pdfRenderer = new PDFRenderer(document);

for(int i=0; i<document.getPages().getCount(); i++){
    BufferedImage bim = pdfRenderer.renderImage(i, 1.0f, ImageType.RGB); //Get bufferedImage for page "i" with scale 1
    ByteArrayOutputStream os = new ByteArrayOutputStream();
    ImageIO.write(bim, "jpg", os);
    InputStream is = new ByteArrayInputStream(os.toByteArray());
    //Do whatever you need with the inputstream
}
document.close()

**Tilman Hausherr** · Accepted Answer · 2015-06-07T12:32:05+00:00

In PDFBox 1.8, the easiest way is to use write2OutputStream(), so your first code block would now look like this:

while (imageIter.hasNext()) {
    String key = (String) imageIter.next();
    PDXObjectImage image = (PDXObjectImage) images.get(key);

    FileOutputStream output = new FileOutputStream(new File(
            "C:\\Users\\Anton\\Documents\\lol\\test.jpg"));
    image.write2OutputStream(output);
}

advanced solution, as long as you're really sure you have only JPEGs that display properly, i.e. have no unusual colorspace:

while (imageIter.hasNext()) {
    String key = (String) imageIter.next();
    PDXObjectImage image = (PDXObjectImage) images.get(key);

    FileOutputStream output = new FileOutputStream(new File(
            "C:\\Users\\Anton\\Documents\\lol\\test.jpg"));
    InputStream is = image.getPDStream().getPartiallyFilteredStream(DCT_FILTERS);
    byte[] buffer = new byte[1024];
    while (is.read(buffer) > 0) {
        output.write(buffer);
    }
}

The second solution removes all filters except the DCT (= JPEG) filter. Some older PDFs have several filters, e.g. ascii85 and DCT.

Now even if you created the image with JPEGs, you don't know what your PDF creation software did. One way to find out what type of image it is, is to check what class it is (use instanceof):

- PDPixelMap => PNG
- PDJpeg => JPEG
- PDCcitt => TIF

Another way is to use image.getSuffix().

TechQA.

PDFBox: extracting images from pdf to inputstream

There are 3 answers

Related Questions in JAVA

Related Questions in INPUTSTREAM

Related Questions in PDFBOX

Popular Questions

Popular Tags

Trending Questions