I am experiencing the following behavior in Python 2.7:
>>> a1 = u'\U0001f04f' #1
>>> a2 = u'\ud83c\udc4f' #2
>>> a1 == a2 #3
False
>>> a1.encode('utf8') == a2.encode('utf8') #4
True
>>> a1.encode('utf8').decode('utf8') == a2.encode('utf8').decode('utf8') #5
True
>>> u'\ud83c\udc4f'.encode('utf8') #6
'\xf0\x9f\x81\x8f'
>>> u'\ud83c'.encode('utf8') #7
'\xed\xa0\xbc'
>>> u'\udc4f'.encode('utf8') #8
'\xed\xb1\x8f'
>>> '\xd8\x3c\xdc\x4f'.decode('utf_16_be') #9
u'\U0001f04f'
What is the explanation for this behavior? More specifically:
- I'd expect two strings to be equal if statement #5 is true, while #3 proves otherwise.
- Encoding both code points together like in statement #6 yields results different from when encoded one by one in #7 and #8. Looks like the two code points are treated as one 4-byte code point. But what if I actually want them to be treated as two different code points?
- As you can see from #9 the numbers in
a2
are actuallya1
encoded using UTF-16-BE but although they were specified as Unicode code points using\u
inside a Unicode string (!), Python still could somehow get to equality in #5. How could it be possible?
Nothing makes sense here! What's going on?
A Python 2 is violating the Unicode standard here, by permitting you to encode codepoints in the range U+D800 to U+DFFF, at least in a UCS4 build. From Wikipedia:
The official UTF-8 standard has no encoding for UTF-16 surrogate pair codepoints, so Python 3 raises an exception when you try:
But Python 2's Unicode support is a bit more rudimentary, and the behaviour you observe varies with the specific UCS2 / UCS4 build variant; on a UCS2 build, your variables are equal:
because in such a build all non-BMP codepoints are encoded as UTF-16 surrogate pairs (extending on the UCS2 standard).
So on a UCS2 build there is no difference between your two values, and the choice to encode to the full non-BMP codepoint is entirely valid when you assume you would want to encode U+1F04F and other such codepoints. The UCS4 build just matches that behaviour.