I'm trying to use PurePDF to gather some information inside a PDF file, but can't manage to have PurePDF read it.
Whenever PurePDF tries to read any pdf, it says it can't find its header, I tried debugging it and noticed the string read from bytearray are coming as japanese characters! I have tried changing the endian of my pdf's bytearray before passing it to PurePDF, but didn't change anything.
The pdf file is ok as I can see the "%PDF-" header whenever I open it as text, but for some reason actionscript is getting wrong charcodes so PurePDF just can't work at all.
Any ideas?
Thanks.
Update: I'm not a bytearray specialist, but I decided to man it and follow the code execution through the debugger, and found out it was using readInt() to get the characters, I just rewrote it to readByte() and now it is reading the PDF! I'm still to see if the features will work... Can anyone who is more into low-level programming explain me what might be happening? I don't think the project is broken in the svn
This is the code I have been using, I think it is quite straightforward:
private function loadPdf():void
{
var loader:URLLoader=new URLLoader();
loader.dataFormat=URLLoaderDataFormat.BINARY;
loader.addEventListener(Event.COMPLETE, onLoadComplete);
loader.load(new URLRequest(PDF_FILE));
}
protected function onLoadComplete(event:Event):void
{
var data:ByteArray = URLLoader(event.target).data as ByteArray;
pdfReader = new PdfReader(data);
pdfReader.readPdf();
}
I haven't worked with PurePDF before but I have used bytearray to extract information from files. What exactly do you want to get from this pdf? Do you want to extract just text? Also can you upload a link to the PDF? Will be easier to help if we are looking at the same thing.
About the Japanese text... When you read the PDF in a bytearray don't expect to easily find human readable text because most of that data is for setting up file structure etc. Actual text & pictures from the PDF are placed inside tags called Streams. So usually you find a stream of text & extract that into your bytearray. To correctly display the text you then use the decoder-type (UTF-8, UTF-16 etc) as mentioned in PDF data.
This link below explains better about PDF streams: ( "/Length" becomes your bytearray length and "Filter" tells you the decode type (charset type eg. ASCII) etc )
http://blog.didierstevens.com/2008/05/19/pdf-stream-objects/
Anyway all this makes sense if you open your PDF in a Hex editor. Try the one below if you need one. Now you can see where your streams positions are and tell AS3 to extract from there:
http://www.hhdsoftware.com/free-hex-editor
If there's still a problem, upload your PDF somewhere and say exactly what you're trying to extract from the document. I will try to give exact help for that (no promises, just trying to help).. Peace.