Decoding a url-encoded windows-1251 (cp1251) string with JavaScript

6.6k views Asked by At

I have faced a problem, unfortunately, I have not found a correct solution: I need to decode url-slice that is encoded with windows-1251 (cp1251).

I know there are theese methods - decodeURI() and decodeURIComponent(), but they work for UTF-8 only (as I have understood). A solution that I found uses deprecated methods escape() and unescape().

For example, there is sequence:

%EF%F0%EE%E3%F0%E0%EC%EC%E8%F0%EE%E2%E0%ED%E8%E5 (программирование)

The methods decodeURI() and decodeURIComponent() will cause an exception.

Will be grateful for the help.

2

There are 2 answers

5
Nickolay On BEST ANSWER

There's no built-in support for the percent-encoding scheme with legacy charsets in the browser, as far as I can see. You'll have to:

  1. find the %-escapes representing the win-1251 octets,
  2. decode the win-1251 octets to the corresponding characters (JS String)

Below is one way to do it. For the #1 I assume that only 3-character upper-case escapes need decoding, and the rest of the string is already ASCII, so I just use inputStr.replace(/%([0-9A-Z]{2})/g, replacerFunction) for this.

For the actual decoding you need a mapping from the win-1251 octets to JS characters. In the example below I build the mapping using TextDecoder.decode() API, just for fun (and in case someone finds this answer while trying to convert between different charsets in JS). (Note: it isn't universally supported as of this time -- only Gecko/Blink support it).

There's also https://github.com/mathiasbynens/windows-1251 , which I initially wanted to use for this answer, but it turned out to be easier to just build the decoding map by hand.

var decodeMap = {};
var win1251 = new TextDecoder("windows-1251");
for (var i = 0x00; i <= 0xFF; i++) {
  var hex = (i <= 0x0F ? "0" : "") +      // zero-padded
            i.toString(16).toUpperCase();
  decodeMap[hex] = win1251.decode(Uint8Array.from([i]));
}
// console.log(decodeMap);
// {"10":"\u0010", ... "40":"@","41":"A","42":"B", ... "C0":"А","C1":"Б", ...


// Decodes a windows-1251 encoded string, additionally
// encoded as an ASCII string where each non-ASCII character of the original
// windows-1251 string is encoded as %XY where XY (uppercase!) is a
// hexadecimal representation of that character's code in windows-1251.
function percentEncodedWin1251ToDOMString(str) {
  return str.replace(/%([0-9A-F]{2})/g,
    (match, hex) => decodeMap[hex]);
}

console.log(percentEncodedWin1251ToDOMString("%EF%F0%EE%E3%F0%E0%EC%EC%!%E8%F0%EE%E2%E0%ED%E8%FFa"))

0
Дмитрий Кондратьев On
  1. find strings: "%EF%F0%EE%E3%F0%E0%EC%EC", "%E8%F0%EE%E2%E0%ED%E8%E5"
  2. replace "%" with ",0x": ",0xEF,0xF0,0xEE,0xE3,0xF0,0xE0,0xEC,0xEC", ",0xE8,0xF0,0xEE,0xE2,0xE0,0xED,0xE8,0xE5"
  3. slice without first comma: "0xEF,0xF0,0xEE,0xE3,0xF0,0xE0,0xEC,0xEC", "0xE8,0xF0,0xEE,0xE2,0xE0,0xED,0xE8,0xE5"
  4. split into an array of strings: ["0xEF","0xF0","0xEE","0xE3","0xF0","0xE0","0xEC","0xEC"], ["0xE8","0xF0","0xEE","0xE2","0xE0","0xED","0xE8","0xE5"]
  5. create an array of bytes: [239,240,238,227,240,224,236,236], [232,240,238,226,224,237,232,229]
  6. decode bytes to strings win-1251: "программ", "ирование" and replace with them
var win1251 = new TextDecoder("windows-1251"),
s = "%EF%F0%EE%E3%F0%E0%EC%EC%!%E8%F0%EE%E2%E0%ED%E8%E5a"
s = s.replace(/(?:%[0-9A-F]{2})+/g,
s => win1251.decode(new Uint8Array(
s.replace(/%/g, ",0x").slice(1).split(",")
)))
alert(s)