Are surrogate pairs the only way to represent code points larger than 2 bytes in UTF-16?

345 views Asked by At

I know that this is probably a stupid question, but I need to be sure on this issue. So I need to know for example if a programming language says that its String type uses UTF-16 encoding, does that mean:

  1. it will use 2 bytes for code points in the range of U+0000 to U+FFFF.
  2. it will use surrogate pairs for code points larger than U+FFFF (4 bytes per code point).

Or does some programming languages use their own "tricks" when encoding and do not follow this standard 100%.

2

There are 2 answers

0
Kerrek SB On BEST ANSWER

UTF-16 is a specified encoding, so if you "use UTF-16", then you do what it says and don't invent any "tricks" of your own.

I wouldn't talk about "two bytes" the way you do, though. That's a detail. The key part of UTF-16 is that you encode code points as a sequence of 16-bit code units, and pairs of surrogates are used to encode code points greater than 0xFFFF. The fact that one code unit is comprised of two 8-bit bytes is a second layer of detail that applies to many systems (but there are systems with larger byte sizes where this isn't relevant), and in that case you may distinguish big- and little-endian representations.

But looking the other direction, there's absolutely no reason why you should use UTF-16 specifically. Ultimately, Unicode text is just a sequence of numbers (of value up to 221), and it's up to you how to represent and serialize those.

I would happily make the case that UTF-16 is a historic accident that we probably wouldn't have done if we had to redo everything now: It is a variable-length encoding just as UTF-8, so you gain no random access, as opposed to UTF-32, but it is also verbose. It suffers endianness problems, unlike UTF-8. Worst of all, it confuses parts of the Unicode standard with internal representation by using actual code point values for the surrogate pairs.

The only reason (in my opinion) that UTF-16 exists is because at some early point people believed that 16 bit would be enough for all humanity forever, and so UTF-16 was envisaged to be the final solution (like UTF-32 is today). When that turned out not to be true, surrogates and wider ranges were tacked onto UTF-16. Today, you should by and large either use UTF-8 for serialization externally or UTF-32 for efficient access internally. (There may be fringe reasons for preferring maybe UCS-2 for pure Asian text.)

0
bobince On

UTF-16 per se is standard. However most languages whose strings are based on 16-bit code units (whether or not they claim to ‘support’ UTF-16) can use any sequence of code units, including invalid surrogates. For example this is typically an acceptable string literal:

"x \uDC00 y \uD800 z"

and usually you only get an error when you attempt to write it to another encoding.

Python's optional encode/decode option surrogateescape uses such invalid surrogates to smuggle tokens representing the single bytes 0x80–0xFF into standalone surrogate code units U+DC80–U+DCFF, resulting in a string such as this. This is typically only used internally, so you're unlikely to meet it in files or on the wire; and it only applies to UTF-16 in as much as Python's str datatype is based on 16-bit code units (which is on ‘narrow’ builds between 3.0 and 3.3).

I'm not aware of any other commonly-used extensions/variants of UTF-16.