How do you determine the byte width of a UTF-16 character?

429 views Asked by At

What are the rules for reading a UTF-16 byte stream, to determine how many bytes a character takes up? I've read the standards, but based on empirical observations of real-world UTF-16 encoded streams, it looks like there are certain where the standards don't hold true (or there's an aspect of the standard that I'm missing).

From the reading the UTF-16 standard https://www.rfc-editor.org/rfc/rfc2781:

Value of leading 2 bytes Resulting character length (bytes)
0x0000-0xC7FF 2
0xD800-0xDBFF 4
0xDC00-0xDFFF Invalid sequence (RFC2781 2.2.2)
0xDFFF-0xFFFF 4

In practice, this appears to hold true, for some cases at least. Using an ad-hoc SQL script (SQL Server 2019; UTF-16 collation), but also verified with an online decoder:

Character Unicode Name ISO 10646 UTF-16 Encoding (hexadecimal, big endian) Size (bytes)
A LATIN CAPITAL LETTER A U+0041 00 41 2
Б CYRILLIC CAPITAL LETTER BE U+0411 04 11 2
KATAKANA LETTER SMALL A U+30A1 30 A1 2
RABBIT FACE U+1F430 D8 3D DC 30 4

However when encoding the following ISO 10646 character into UTF-16, it appears to be 4 bytes, but reading the leading 2 bytes appears to give no indication that it will be this long:

Character Unicode Name UTF-16 Encoding (hexadecimal, big endian) Size (bytes)
⚕️ STAFF OF AESCULAPIUS 26 95 FE 0F 4

Whilst I'd rather keep my question software-agnostic; the following SQL will reproduce this behaviour on Microsoft SQL Server 2019, with default collation and default language. (Note that SQL Server is little endian).

select cast(N'⚕️' as varbinary);
----------
0x95260FFE

Quite simply, how/why do you read 0x2695 and think "I'll need to read in the next word for this character."? Why doesn't this appear to align with the published UTF-16 standard?

2

There are 2 answers

1
Rob Napier On BEST ANSWER

The formal definition of all of this is called an "extended grapheme cluster," and it's defined in the Unicode Text Segmentation report. As Joachim Sauer notes, it's wise to be careful with the term "character" in Unicode.

Code points are what "U+...." syntax is referring to, and is attempting to capture a "unit" of written language, for example "an acute accent." But what a reader would think of a character (for example "an e with an acute accent") is a "grapheme cluster" and is made up of one or more code points. What is ultimately rendered to the screen is a "glyph" which is both context- and font-dependent.

Grapheme clusters in Unicode are actually more subtle than this. Unicode attempts to define them in a "neutral" way. (There's really no such thing as "neutral" when thinking about languages, but Unicode does try.) For example, in Slovak, ch, dz, and dž are each one letter, but are considered two grapheme clusters in Unicode. (Try to count the "letters" in a Slovak word. There are words that contain the letter dz and other words that have the letter d followed by the letter z. Oh human writing systems. I love you so much.)

The mapping of grapheme clusters to glyphs is also complex. For example, in Arabic, the single glyph لا is actually two grapheme clusters, ل (ARABIC LETTER LAM) followed by ا (ARABIC LETTER ALEF). If you use your mouse to select the glyph, you'll see there are two selectable pieces, and if you copy and paste them to another window you'll see them transform into their component parts. (Just to make thing even more complicated, Unicode also defines a single code point for ligature, ARABIC LIGATURE LAM WITH ALEF ISOLATED FORM: ﻻ. If you try to select part of that one, you'll find you can't. It's one "character.")

Your specific case is a bit more special. The Variation Selector predates Unicode, and is mostly designed to handle different variations of Han (Chinese) characters. However, as with every Unicode feature, it eventually has come to be used primarily for emoji. VS-16 is the "emoji" presentation form. The most famous example is the red heart, which is HEAVY BLACK HEART ❤, followed by VS-16: ❤️.

Similarly, your character U+2695 STAFF OF AESCULAPIUS is a single code point, and it looks like this by default (text style): ⚕. When you add VS-16, it is rendered in "emoji style": ⚕️. In some ways it's the same "character." Or is it? Depends on what you're using it for.

Emoji style is typically a bit larger and centered in its block, sometimes adding color. Notice where the period after the staff is drawn in each case (there are no extra spaces in the second example; the glyph is just much wider).

There are other combining systems as well:

  • U+0031: 1
  • U+0031 U+20e3: 1⃣ (+ COMBINING ENCLOSING KEYCAP, default text style)
  • U+0031 U+20e3 U+fe0f: 1⃣️ (+ VARIATION SELECTOR-16, emoji style)

All of these predate Unicode. Modern emoji is dramatically more complicated, and includes several combining systems of its own (including two that are currently just used for flags).

But luckily, to your actual question, your wife is correct, and you can generally just consume all trailing code points that are marked "combining" to form an extended grapheme cluster, and that is kind of a "character" for some broad enough definition of "character."

2
Rab On

All of your assertions are completely correct; your interpretation of the UTF-16 standards is correct and complete.

In your empirical observations however, you've assumed that you only have one character. In actuality, you've ran into a nuance of the Unicode implementation. Your "character" is actually two (albeit technically, not visually): U+2695 "STAFF OF AESCULAPIUS" followed by U+FE0F "VARIATION SELECTOR-16". The second character is a non-spacing mark which combines with the base character for the purpose of rendering a character variant.

This results in the byte sequence 26 95 FE 0F, however as you note neither of the words fall within the UTF-16 reserved extension character range. But this is because neither of them require the UTF-16 4 byte extension. They're simply classified as two discrete Unicode characters.

From 7.9 Combining Marks in ISO 10646: Universal Coded Character Set (UCS):,

Combining marks are a special class of characters in the Unicode Standard that are intended to combine with a preceding character, called their base.

Combining marks usually have a visible glyphic form... a combining mark may interact graphically with neighbouring characters in various ways.

http://unicode.org/L2/L2010/10038-fcd10646-main.pdf


To explain why I'm answering my own question; I had my SO question all ready to fire off. My wife came into my office; after looking over my shoulder she whispered into my ear, "You know combination characters are a thing, right?". I've however still asked the question and answered it myself, in case my wife's sweet nothings help another member of the community.