The proper way of encoding detection in perl

Question

The proper way of encoding detection in perl

1.3k views Asked by Igor Shalyminov At 27 July 2012 at 16:04

I've got these two strings:

%EC%E0%EC%E0+%EC%FB%EB%E0+%F0%E0%EC%F3
%D0%BC%D0%B0%D0%BC%D0%B0%20%D0%BC%D1%8B%D0%BB%D0%B0%20%D1%80%D0%B0%D0%BC%D1%83

This is a url-encoded phrase in Russian in cp-1251 and utf-8 respectively. I want to see them in Russian in my utf-8 terminal using perl. Unfortunately, perl module Encode::Detect (after url-decoding) can't detect cp-1251 of the first example. Instead, it proposes this: "x-euc-tw".

The question is, what is the proper way of detecting the right encoding in this case (specifying locale parameters, using other modules...)?

Original Q&A

There are 2 answers

Joni On 27 July 2012 at 17:29

Encode::Detect, which uses the Mozilla universal character set detector, works by letting different character set probers look at the data. The probers then report different confidence levels and the prober with highest confidence wins. This process depends on the input only; it is not affected by locale or other external settings. In this case, for whatever reason, the prober for euc-tw is reporting a higher confidence than the prober for windows-1251, and there's nothing you can do short of changing the data or modifying the source code.

You could try using Encode::Guess which allows specifying a list of encodings to choose from.

**ikegami** · Accepted Answer · 2012-07-27T17:45:31+00:00

Are UTF-8 and cp1251 the only two options? The odds of having cp1251 text that's also valid UTF-8 is extremely tiny. (It would be gibberish.) So you can do

use Encode qw( decode );
my $decoded = eval { decode('UTF-8', $encoded, Encode::FB_CROAK) }
    // decode('cp1251', $encoded);

This will be far far more accurate that an encoding guesser.

TechQA.

The proper way of encoding detection in perl

There are 2 answers

Related Questions in PERL

Related Questions in UTF-8

Related Questions in ENCODE

Related Questions in CP1251

Popular Questions

Popular Tags

Trending Questions