Reading a PDF, character problems

699 views Asked by At

I'm trying to use PurePDF to gather some information inside a PDF file, but can't manage to have PurePDF read it.

Whenever PurePDF tries to read any pdf, it says it can't find its header, I tried debugging it and noticed the string read from bytearray are coming as japanese characters! I have tried changing the endian of my pdf's bytearray before passing it to PurePDF, but didn't change anything.

The pdf file is ok as I can see the "%PDF-" header whenever I open it as text, but for some reason actionscript is getting wrong charcodes so PurePDF just can't work at all.

Any ideas?

Thanks.


Update: I'm not a bytearray specialist, but I decided to man it and follow the code execution through the debugger, and found out it was using readInt() to get the characters, I just rewrote it to readByte() and now it is reading the PDF! I'm still to see if the features will work... Can anyone who is more into low-level programming explain me what might be happening? I don't think the project is broken in the svn

This is the code I have been using, I think it is quite straightforward:

private function loadPdf():void
    {
        var loader:URLLoader=new URLLoader();
        loader.dataFormat=URLLoaderDataFormat.BINARY;
        loader.addEventListener(Event.COMPLETE, onLoadComplete);
        loader.load(new URLRequest(PDF_FILE));
    }

protected function onLoadComplete(event:Event):void
    {
        var data:ByteArray = URLLoader(event.target).data as ByteArray;
        pdfReader = new PdfReader(data);
        pdfReader.readPdf();
    }
1

There are 1 answers

2
VC.One On

I haven't worked with PurePDF before but I have used bytearray to extract information from files. What exactly do you want to get from this pdf? Do you want to extract just text? Also can you upload a link to the PDF? Will be easier to help if we are looking at the same thing.

About the Japanese text... When you read the PDF in a bytearray don't expect to easily find human readable text because most of that data is for setting up file structure etc. Actual text & pictures from the PDF are placed inside tags called Streams. So usually you find a stream of text & extract that into your bytearray. To correctly display the text you then use the decoder-type (UTF-8, UTF-16 etc) as mentioned in PDF data.

This link below explains better about PDF streams: ( "/Length" becomes your bytearray length and "Filter" tells you the decode type (charset type eg. ASCII) etc )

http://blog.didierstevens.com/2008/05/19/pdf-stream-objects/

Anyway all this makes sense if you open your PDF in a Hex editor. Try the one below if you need one. Now you can see where your streams positions are and tell AS3 to extract from there:

http://www.hhdsoftware.com/free-hex-editor

If there's still a problem, upload your PDF somewhere and say exactly what you're trying to extract from the document. I will try to give exact help for that (no promises, just trying to help).. Peace.