Python 2.7: Strange Unicode behavior

396 views Asked by At

I am experiencing the following behavior in Python 2.7:

>>> a1 = u'\U0001f04f'  #1
>>> a2 = u'\ud83c\udc4f'  #2
>>> a1 == a2  #3
False
>>> a1.encode('utf8') == a2.encode('utf8')  #4
True
>>> a1.encode('utf8').decode('utf8') == a2.encode('utf8').decode('utf8')  #5
True
>>> u'\ud83c\udc4f'.encode('utf8') #6
'\xf0\x9f\x81\x8f'
>>> u'\ud83c'.encode('utf8')  #7
'\xed\xa0\xbc'
>>> u'\udc4f'.encode('utf8')  #8
'\xed\xb1\x8f'
>>> '\xd8\x3c\xdc\x4f'.decode('utf_16_be')  #9
u'\U0001f04f'

What is the explanation for this behavior? More specifically:

  1. I'd expect two strings to be equal if statement #5 is true, while #3 proves otherwise.
  2. Encoding both code points together like in statement #6 yields results different from when encoded one by one in #7 and #8. Looks like the two code points are treated as one 4-byte code point. But what if I actually want them to be treated as two different code points?
  3. As you can see from #9 the numbers in a2 are actually a1 encoded using UTF-16-BE but although they were specified as Unicode code points using \u inside a Unicode string (!), Python still could somehow get to equality in #5. How could it be possible?

Nothing makes sense here! What's going on?

1

There are 1 answers

0
Martijn Pieters On BEST ANSWER

A Python 2 is violating the Unicode standard here, by permitting you to encode codepoints in the range U+D800 to U+DFFF, at least in a UCS4 build. From Wikipedia:

The Unicode standard permanently reserves these code point values for UTF-16 encoding of the high and low surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode standard says that no UTF forms, including UTF-16, can encode these code points.

The official UTF-8 standard has no encoding for UTF-16 surrogate pair codepoints, so Python 3 raises an exception when you try:

>>> '\ud83c\udc4f'.encode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed

But Python 2's Unicode support is a bit more rudimentary, and the behaviour you observe varies with the specific UCS2 / UCS4 build variant; on a UCS2 build, your variables are equal:

>>> import sys
>>> sys.maxunicode
65535
>>> a1 = u'\U0001f04f'
>>> a2 = u'\ud83c\udc4f'
>>> a1 == a2
True

because in such a build all non-BMP codepoints are encoded as UTF-16 surrogate pairs (extending on the UCS2 standard).

So on a UCS2 build there is no difference between your two values, and the choice to encode to the full non-BMP codepoint is entirely valid when you assume you would want to encode U+1F04F and other such codepoints. The UCS4 build just matches that behaviour.