I am having trouble with non ascii characters being returned. I am not sure at which level the issue resides. It could be the actual PDF encoding, the decoding used by CAM::PDF (which is FlateDecode) or CAM::PDF itself. The following returns a string full of the commands used to create the PDF (Tm, Tj, etc).
use CAM::PDF;
my $filename = "sample.pdf";
my $cam_obj = CAM::PDF->new($filename) or die "$CAM::PDF::errstr\n";
my $tree = $cam_obj->getPageContentTree(1);
my $page_string = $tree->toString();
print $page_string;
You can download sample.pdf here
The text returned in the Tj often has one character which is non ASCII. In the PDF, the actual character is almost always a quote, single or double.
While reproducing this I found that the returned character is consistent within the PDF but varies amongst PDFs. I also noticed the PDF is using a specific font file. I'm now looking into font files to see if the same character can be mapped to varying binary values.
:edit: Regarding Windows-1252. My PDF returns an "Õ" instead of apostrophes. The Õ character is hex 0xD5 in Windows-1252 and UTF-8. If the idea is that the character is encoded with Windows-1252, then it should be a hex 0x91 or 0x92 which it is not. Which is why the following does nothing to the character:
use Encode qw(decode encode);
my $page_string = 'Õ';
my $characters = decode 'Windows-1252', $page_string;
my $octets = encode 'UTF-8', $characters;
open STS, ">TEST.txt";
print STS $octets . "\n";
I'm the author of CAM-PDF. Your PDF is non-compliant. From the PDF 1.7 specification, section 3.2.3 "String Objects":
If you have large quantities of non-ASCII characters, you can represent them using hexadecimal string notation.
EDIT: Perhaps my interpretation of the spec is incorrect, given a_note's alternative answer. I'll have to revisit this... Certainly, the spec could be clearer in this area.