The proper way of encoding detection in perl

1.3k views Asked by At

I've got these two strings:

%EC%E0%EC%E0+%EC%FB%EB%E0+%F0%E0%EC%F3
%D0%BC%D0%B0%D0%BC%D0%B0%20%D0%BC%D1%8B%D0%BB%D0%B0%20%D1%80%D0%B0%D0%BC%D1%83

This is a url-encoded phrase in Russian in cp-1251 and utf-8 respectively. I want to see them in Russian in my utf-8 terminal using perl. Unfortunately, perl module Encode::Detect (after url-decoding) can't detect cp-1251 of the first example. Instead, it proposes this: "x-euc-tw".

The question is, what is the proper way of detecting the right encoding in this case (specifying locale parameters, using other modules...)?

2

There are 2 answers

0
ikegami On BEST ANSWER

Are UTF-8 and cp1251 the only two options? The odds of having cp1251 text that's also valid UTF-8 is extremely tiny. (It would be gibberish.) So you can do

use Encode qw( decode );
my $decoded = eval { decode('UTF-8', $encoded, Encode::FB_CROAK) }
    // decode('cp1251', $encoded);

This will be far far more accurate that an encoding guesser.

1
Joni On

Encode::Detect, which uses the Mozilla universal character set detector, works by letting different character set probers look at the data. The probers then report different confidence levels and the prober with highest confidence wins. This process depends on the input only; it is not affected by locale or other external settings. In this case, for whatever reason, the prober for euc-tw is reporting a higher confidence than the prober for windows-1251, and there's nothing you can do short of changing the data or modifying the source code.

You could try using Encode::Guess which allows specifying a list of encodings to choose from.