Python and support for EUC-KR

111 views Asked by At

I am trying to play with string encoding in python 3.10, in particular to demonstrate the yen/won/backslash encoding issue.

So the following behavior (irreversible mapping) makes sense to me:

>>> "¥".encode("shift-jis").decode("shift-jis")
'\\'

I can also verify with my iconv copy:

$ echo -n "¥" | iconv -f utf-8 -t shift-jis | hexdump
0000000 005c
0000001

Now I struggle to understand the following behavior:

>>> "₩".encode("euc-kr")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'euc_kr' codec can't encode character '\u20a9' in position 0: illegal multibyte sequence

While:

$ echo -n "₩" | hexdump
0000000 82e2 00a9
0000003
$ echo -n "₩" | iconv -f utf-8 -t euc-kr | hexdump
0000000 dca3
0000002
$ echo -n "₩" | iconv -f utf-8 -t euc-kr | iconv -f euc-kr -t utf-8 | hexdump
0000000 bfef 00a6
0000003

My naive understanding of KS X 1001 (registered as ISO-IR 149), was that really is \ (*):

Encoding schemes of KS X 1001 include EUC-KR (in both ASCII and ISO 646-KR based variants, the latter of which includes a won currency sign (₩) at byte 0x5C rather than a backslash)

What did I misundertood from KS X 1001 and ?

  1. Why python isn't returning the \ symbol ?
  2. Why iconv is returning code dca3 (U+FFE6 FULLWIDTH WON SIGN) for (U+20A9, WON SIGN) ?

For reference:

$ python3 --version
Python 3.10.12

and

$ iconv --version
iconv (Ubuntu GLIBC 2.35-0ubuntu3.6) 2.35
1

There are 1 answers

1
Mark Ransom On

The problem is that Unicode has two Won symbols: U+20A9 Won Sign and U+FFE6 Fullwidth Won Sign. Python has implemented the fullwidth version but you're testing the other one. They may have done this precisely to avoid the problem you were testing for. This works fine:

"\uffe6".encode("euc-kr")
b'\xa3\xdc'