CAM::PDF returning non ascii character instead of quotes

1.7k views Asked by At

I am having trouble with non ascii characters being returned. I am not sure at which level the issue resides. It could be the actual PDF encoding, the decoding used by CAM::PDF (which is FlateDecode) or CAM::PDF itself. The following returns a string full of the commands used to create the PDF (Tm, Tj, etc).

use CAM::PDF;

my $filename = "sample.pdf"; 
my $cam_obj = CAM::PDF->new($filename) or die "$CAM::PDF::errstr\n";
my $tree = $cam_obj->getPageContentTree(1);
my $page_string = $tree->toString();
print $page_string;

You can download sample.pdf here

The text returned in the Tj often has one character which is non ASCII. In the PDF, the actual character is almost always a quote, single or double.

While reproducing this I found that the returned character is consistent within the PDF but varies amongst PDFs. I also noticed the PDF is using a specific font file. I'm now looking into font files to see if the same character can be mapped to varying binary values.

:edit: Regarding Windows-1252. My PDF returns an "Õ" instead of apostrophes. The Õ character is hex 0xD5 in Windows-1252 and UTF-8. If the idea is that the character is encoded with Windows-1252, then it should be a hex 0x91 or 0x92 which it is not. Which is why the following does nothing to the character:

use Encode qw(decode encode);
my $page_string = 'Õ';
my $characters = decode 'Windows-1252', $page_string;
my $octets = encode 'UTF-8', $characters;
open STS, ">TEST.txt";
print STS $octets . "\n";
3

There are 3 answers

6
Chris Dolan On BEST ANSWER

I'm the author of CAM-PDF. Your PDF is non-compliant. From the PDF 1.7 specification, section 3.2.3 "String Objects":

"Within a literal string, the backslash (\) is used as an escape character for various purposes, such as to include newline characters, nonprinting ASCII characters, unbalanced parentheses, or the backslash character itself in the string. [...] The \ddd escape sequence provides a way to represent characters outside the printable ASCII character set."

If you have large quantities of non-ASCII characters, you can represent them using hexadecimal string notation.

EDIT: Perhaps my interpretation of the spec is incorrect, given a_note's alternative answer. I'll have to revisit this... Certainly, the spec could be clearer in this area.

1
a_note On

"receiving" - where? You get "Õ" instead of expected what? And doing exactly what? You know that windows command prompt uses dos code page, not windows-1252, right? (oops, new thread again... probably i should register here :-) )

8
a_note On

Sorry to intrude, and with all due respect, sir, but file IS compliant. Section 3.2.3 further states:

[The \ddd] notation provides a way to specify characters outside the 7-bit ASCII character set by using ASCII characters only. However, any 8-bit value may appear in a string.