I am using PDFBox to extract the images from my pdf (which contains only jpg's).
Since I will save those images inside my database, I would like to directly convert each image to an inputstream object first without placing the file temporary on my file sysem. I am facing difficulties with this however. I think it has to do because of the use of image.getPDFStream().createInputStream()
as I did in the following example:
while (imageIter.hasNext()) {
String key = (String) imageIter.next();
PDXObjectImage image = (PDXObjectImage) images.get(key);
FileOutputStream output = new FileOutputStream(new File(
"C:\\Users\\Anton\\Documents\\lol\\test.jpg"));
InputStream is = image.getPDStream().createInputStream(); //this gives me a corrupt file
byte[] buffer = new byte[1024];
while (is.read(buffer) > 0) {
output.write(buffer);
}
}
However this works:
while (iter.hasNext()) {
PDPage page = (PDPage) iter.next();
PDResources resources = page.getResources();
Map<String, PDXObject> images = resources.getXObjects();
if (images != null) {
Iterator<?> imageIter = images.keySet().iterator();
while (imageIter.hasNext()) {
String key = (String) imageIter.next();
PDXObjectImage image = (PDXObjectImage) images.get(key);
image.write2file(new File("C:\\Users\\Anton\\Documents\\lol\\test.jpg")); //this works however
}
}
}
Any idea how I can convert each PDXObjectImage (or any other object I can get) to an inputstream?
In PDFBox 1.8, the easiest way is to use write2OutputStream(), so your first code block would now look like this:
advanced solution, as long as you're really sure you have only JPEGs that display properly, i.e. have no unusual colorspace:
The second solution removes all filters except the DCT (= JPEG) filter. Some older PDFs have several filters, e.g. ascii85 and DCT.
Now even if you created the image with JPEGs, you don't know what your PDF creation software did. One way to find out what type of image it is, is to check what class it is (use instanceof):
Another way is to use image.getSuffix().