I have a reasonable size flat file database of text documents mostly saved in 8859 format which have been collected through a web form (using Perl scripts). Up until recently I was negotiating the common 1252 characters (curly quotes, apostrophes etc.) with a simple set of regex's:
$line=~s/\x91/\&\#8216\;/g; # smart apostrophe left
$line=~s/\x92/\&\#8217\;/g; # smart apostrophe right
... etc.
However since I decided I ought to be going Unicode, and have converted all my scripts to read in and output utf8 (which works a treat for all new material), the regex for these (existing) 1252 characters no longer works and my Perl html output outputs literally the 4 characters: '\x92' and '\x93' etc. (at least that's how it appears on a browser in utf8 mode, downloading (ftp not http) and opening in a text editor (textpad) it's different, a single undefined character remains, and opening the output file in Firefox default (no content type header) 8859 mode renders the correct character).
The new utf8 pragmas at the start of the script are:
use CGI qw(-utf8); use open IO => ':utf8';
I understand this is due to utf8 mode making the characters double byte instead of single byte and applies to those chars in the 0x80 to 0xff range, having read up the article on wikibooks relating to this, however I was non the wiser as to how to filter them. Ideally I know I ought to resave all the documents in utf8 mode (since the flat file database now contains a mixture of 8859 and utf8), however I will need some kind of filter in the first place if I'm going to do this anyway.
And I could be wrong as to the 2-byte storage internally, since it did seem to imply that Perl handles stuff very differently according to various circumstances.
If anybody could provide me with a regex solution I would be very grateful. Or some other method. I have been tearing my hair out for weeks on this with various attempts and failed hacking. There's simply about 6 1252 characters that commonly need replacing, and with a filter method I could resave the whole flippin lot in utf8 and forget there ever was a 1252...
Ikegami already mentioned the Encoding::FixLatin module.
Another way to do it, if you know that each string will be either UTF-8 or CP1252, but not a mixture of both, is to read it as a binary string and do:
Compared to Encoding::FixLatin, this has two small advantages: a slightly lower chance of misinterpreting CP1252 text as UTF-8 (because the entire string must be valid UTF-8) and the possibility of replacing CP1252 with some other fallback encoding. A corresponding disadvantage is that this code could fall back to CP1252 on strings that are not entirely valid UTF-8 for some other reason, such as because they were truncated in the middle of a multi-byte character.