Isn't a 2-byte char datatype insufficient to deal with the concept of "characters" in a Unicode string?

490 views Asked by At

Various programming languages use a 2-byte char datatype (not to be confused with C/C++'s char, which is just one byte) out of which strings are constructed. Various utility functions will try to find such a char in a string, like looking for an e in hello, or do other operations that accept or return chars (split, indexof, replace, count number of character occurrences in a string, length, ...).

If you dig deeper you will find out about Unicode code points. And indeed, Java (and I assume other languages as well) lets you iterate those code points. But those seem to be represented by an int (4 bytes) not a char (2 bytes). Very rarely, if ever will you see people using code points to iterate through a string. Since such a code point may span multiple chars (max 2, right? int?) it's not the fastest way to do string operations, but it does seem to be the correct way.

Some programs/frameworks/operating systems(?) will also fail to work correctly with multi-char characters, instead only deleting the second char of it and creating a "corrupted" character.

Shouldn't you always use the methods that operate on code points when dealing with strings? What am I missing? I'm afraid someone will have to explain to me why the world keeps using char when this seems obsolete. Is the size of a char sufficient after all? I know there are additional "helper" characters for "upgrading" other characters (turn an o into ö and so on). How are these handled by char and code point iteration? Isn't there a chance to horribly corrupt your string if you replace chars instead of "whole" code points?

3

There are 3 answers

10
Giacomo Catenazzi On

Note: this is a Western world point of view, in parallel we had Asian language history and evolution, which I skip. In any case, most of character set converted, with Unicode

Historically we had ASCII. In reality, we had also other character encoding, some also without distinguishing lower and upper case, but then ASCII became de facto standard (on western computers, which us Latin scripts). ASCII was not sufficient, so there were some extensions: "code pages", so still every character were 8-bit, but one could select which character set to use (and so which language to support).

All common-used, modern, operating systems were born on such epoch. So programs started with such convention, filesystems, API, text files, etc.

But Internet, and exchanging files were more and more common, so a file produced in Stockholm, were not completely readable in Germany, or in US. ISO standardized some code page (e.g. Latin-1, etc.), which had ASCII + some characters in common, and some parts were different according the encoding. And Windows used the Latin-1 and filled the non-allocated space (you see that it is described as "ANSI"). But also Asian scripts became also important (better computer, so we could effort much more characters for everyday use, not just for typesetting).

So Unicode (and ISO) started to develop a new standard. One set for every character, compatible with all most common charset (so you can transform into Unicode, and back, without losing information: this really helped to do a smooth transition). And such new charset should have 16-bit codepoints [WARNING: this is no more true, but it was so on first Unicode versions]. (for this reason, we have many combining characters, the "Han unification" (merging Chinese, Japanese, and old Korean characters into one), and the special case of encoding the new Korean characters.

New languages adopted such version, so 16bit Unicode characters.

Some operating systems added new API, with these 16-bit characters (Microsoft Windows did together with long names, on filesystem, and in a compatible way, so old computers could read files [just short names, and with 8-bit characters]). In this manner, you had compatibility with old programs, but new programs could (they were not forced) to use the new Unicode.

Old languages and Unix waited, struggling on how to get compatibly AND new Unicode.

This seems your world (as in your question), so early 1990s.

Guess what? 16-bit was not sufficient. So new (now already old) Unicode added planes, and surrogates. Surrogates were a trick to keep allocated 16-bit Unicode valid, but allowing (by using surrogates), to create characters to 0x10FFFF. This was also a difference with ISO, which allowed 31-bit code points.

In the same time, also UTF-8 emerged, so making compatible with ASCII (and with the end-of-string \0 used by many libraries/operating systems), but allowing all new Unicode characters.

Some more modern languages started to implement UTF-32 (so using 32-bit Unicode), some old adapted (e.g. new API), some just keep the surrogate, so changing "code-point" into "code-unit". Python is one of the exception: old language which converted to full Unicode (and now internally, it selects the optimal size 8-bit, 16-bit or 32-bit), but Python 3 conversion but very painful (and incompatible with old code, and after 10 years, many libraries were not ready), so I think other old language will think twice, before trying to "upgrade".

The problem on your "problem" is that to get 16-bit (or 32-bit) character, you need a flag day. Everybody should update, every program and every operating system, on the same day. So you should check old past code, and adapt. Or having two set of libraries and practically all operating system split in two: using old characters, or using the new one.

Personally, I think the Unix way is the best one, so using UTF-8: keep ASCII compatible, and extend. Old programs could process (transparently) Unicode characters, also if they were built before Unicode epoch (for printing, for storage, transmission, etc. obviously to get semantic of characters, they need to be Unicode aware).

Because of code unit (so two 16-bit code unit are sometime required for one Unicode code point), and combining characters (do not assume one glyphs is described just by one code-point), and variants selectors, emojis variants/tags, etc., it makes not much sense to iterate and modify single characters. And we should not forget that fonts may design a glyphs from various "characters".

So, it is too difficult to go to a UTF-32 globally, for all languages, because of existing programs and infrastructure. Now that UTF-8 seems to be dominant, I think we should keep UTF-8: so people will uses Unicode libraries, or just handle byte sequences transparently (maybe just merging, templating, etc.), maybe simple search (for ASCII, else the Unicode string must be normalized).

7
tripleee On

Well, yes. There are roughly three separate cases here.

  1. Languages and platforms which only support 16-bit characters (UCS-2). These cannot support the full Unicode range (notably, recent emoji additions are outside the BMP) but can trivially use 16 bits for everything related to Unicode characters. (You can still mess up by losing track of where you are inside a string, though it should be easy to avoid such errors by always making sure you are at an even byte offset.)

  2. Languages and platforms which support UTF-16 (including surrogates). As you note, you have to know that a single code point can be more than 16 bits, and adjust accordingly. I'm sure there are many Java applications which actually foul up surrogates if you only care to test them.

  3. Languages and platforms which map everything to some internal representation. Ideally there should not even be a way to address the underlying bytes directly unless you specifically need to go there. (Compare with how Python gives you str unless you specifically decode into bytes, or vice versa. It's still possible to mess up if you just copy/paste code from Stack Overflow without understanding what it does.)

Your question sort of presupposes that char and int are exposed and well-defined, but many languages do not easily let you access underlying byte representations with the versatility / abandon of C.

As for the "why", basically Java predates UTF-16 and certainly UTF-8. It's always harder to retrofit a new model onto an existing language and its libraries than to get it right from the start.

When I am writing this, "right" basically means UTF-8, but it's not entirely unproblematic either, though the kind of futzing you need with surrogates etc isn't necessary or useful (or, if you look at it from the other direction, the normal case is now somewhat futzier, but generally for good reasons); the remaining problems are typically endemic to Unicode (normalization of code points, locale-specific collation, rendering support, etc). Perhaps future generations will smirk at this, too. https://utf8everywhere.org/ contains a lot more around how UTF-8 at least shields us from many mistakes which are still common in the 16-bit world.

5
phuclv On

In summary, the answer to the question

Isn't a 2-byte char datatype insufficient to deal with the concept of "characters" in a Unicode string?

is Yes, it's insufficient to store a Unicode character, but you don't need to worry because you don't use and shouldn't use it to iterate

See also

For more details read below


Shouldn't you always use the methods that operate on code points when dealing with strings?

One should almost never do that, because contrary to popular belief, characters in UTF-32 does not have a fixed length. UTF-32 is only a fixed-length encoding for a single code point, but a user-perceived character may be represented by multiple code points:

It is important to recognize that what the user thinks of as a “character”—a basic unit of a writing system for a language—may not be just a single Unicode code point. Instead, that basic unit may be made up of multiple Unicode code points. To avoid ambiguity with the computer use of the term character, this is called a user-perceived character. For example, “G” + grave-accent is a user-perceived character: users think of it as a single character, yet is actually represented by two Unicode code points. These user-perceived characters are approximated by what is called a grapheme cluster, which can be determined programmatically.

Grapheme cluster boundaries are important for collation, regular expressions, UI interactions, segmentation for vertical text, identification of boundaries for first-letter styling, and counting “character” positions within text.

Unicode Text Segmentation - Grapheme Cluster Boundaries

So we should only use graphemes A.K.A user-perceived characters instead and must not break segments into code points. For example people usually iterate strings to find a specific character, and if we want to find an ear of rice U+1F33E then it'll match ‍ unexpectedly because that farmer emoji is encoded as U+1F468 U+200D U+1F33E. And then that index may be used to take a substring from to something else, which may surprise the user a lot. See Why are emoji characters like 👩‍👩‍👧‍👦 treated so strangely in Swift strings?

Another common mistake is that people truncate a string to the first or last N characters and append/prepend "..." to fit it into the UI when it's too long, and then it breaks miserably because the char at index N may be in the middle of a grapheme cluster. For example "‍‍‍‍️‍❤️‍‍" is a not-quite-long string of 3 user-perceived characters but it's made from 21 code points so if you truncate at the 20th character then it'll mess up the output string completely. Or check the Indic string "ফোল্ডার" which can be easily seen has having 4 characters by selecting or iterate through it with mouse or arrow keys (although I must admit I'm not an expert in any Indic languages), yet it's encoded as 7 code points (U+09AB U+09CB U+09B2 U+09CD U+09A1 U+09BE U+09B0) and will behave badly when truncated in the middle. Reversing a string or finding palindromes will have the same fate if not account for multi-codepoint characters

Indic and Arabic languages (Kannada, Bengali, Thai, Burmese, Lao, Malayalese, Hindi, Persian, Arabic, Tamil...) make very heavy use of ZWJ and ZWNJ to modify characters. And in those languages characters also combine with each other or modify each other when there's no ZWJ as in the previous example string. Some other examples: நி (U+0BA8 U+0BBF), षि (U+0937 U+093F). If you delete a code point in the middle or take a substring then it might not work as expected. Many languages like Burmese, Mongolian, CJKV... as well as mathematical symbols and emojis also utilize Variation Selectors (VS) to adjust the previous character. For example က︀ (U+1002 U+FE00), ဂ︀ (U+1000 U+FE00), င︀ (U+1004 U+FE00), ⋚︀ (U+22DA U+FE00), 丸︀ (U+4E38 U+FE00). Here's the complete list of the alternate variants. Removing the VS will change the rendering of the document which may affect the meaning or readability. You can't arbitrarily easily take a substring in an internationalized application

You can check out An Introduction to Writing Systems & Unicode - Complex script rendering and Complex text layout if you're interested in more information about those scripts

Some people have mentioned the use of combining character segments like g̈ (U+0067 U+0308), Å (U+0041 U+030A) or é (U+0065 U+0301), but that's just a tiny not-so-common use case where a character is represented by multiple code points, and are usually convertible to a precomposed character. In many other languages such combining sequences are far more common and are detrimental to the rendering of the texts. I'm going to give some examples below (in brackets []) along with some rules stated in Unicode Text Segmentation:

  • Do not break Hangul syllable sequences.
    • [ In Korean characters can be composed from Jamos: 훯 (U+D6E0 U+11B6), 가 (U+1100 U+1161), 각 (U+1100 U+1161 U+11A8), 까ᇫ (U+1101 U+1161 U+11EB). No one would consider those multiple characters apart from some weird standard ]
  • Do not break before extending characters or ZWJ.
    • [ For example Indic characters like ണ്‍ (U+0D23 U+0D4D U+200D), ല്‍ (U+0D32 U+0D4D U+200D), ര്‍ (U+0D30 U+0D4D U+200D), क्‍ (U+0915 U+094D U+200D) ]
  • Do not break within emoji modifier sequences or emoji zwj sequences.
    • [ ‍♀️ (U+1F3C3 U+1F3FB U+200D U+2640 U+FE0F), ‍♀️ (U+1F3C3 U+1F3FF U+200D U+2640 U+FE0F), ‍‍‍ (U+1F469 U+200D U+1F469 U+200D U+1F466 U+200D U+1F466), ‍‍‍ (U+1F468 U+200D U+1F469 U+200D U+1F466 U+200D U+1F466), ‍️ (it's a single super-wide emoji, not two, combining from U+1F636 U+200D U+1F32B U+FE0F), ‍❤️‍ (U+1F468 U+200D U+2764 U+FE0F U+200D U+1F468), ‍❤️‍ (U+1F469 U+1F3FC U+200D U+2764 U+FE0F U+200D U+1F468 U+1F3FD), ‍❤️‍‍ (U+1F469 U+1F3FB U+200D U+2764 U+FE0F U+200D U+1F48B U+200D U+1F469 U+1F3FF), ‍ (U+1F431 U+200D U+1F680), ‍ (U+1F431 U+200D U+1F464), ‍ (U+1F431 U+200D U+1F409), ‍ (U+1F431 U+200D U+1F4BB), ‍ (U+1F431 U+200D U+1F453), ‍ (U+1F431 U+200D U+1F3CD), (U+1F467 U+1F3FB), (U+1F935 U+1F3FB), ❤️ (U+2764 U+FE0F), 1️⃣ (U+0031 U+FE0F U+20E3), ⚕️ (U+2695 U+FE0F), ©️ (U+00A9 U+FE0F), ®️ (U+00AE U+FE0F), ‼️ (U+203C U+FE0F), ™️ (U+2122 U+FE0F), ☑︎ (U+2611 U+FE0E), ‍☠ (U+1F3F4 U+200D U+2620 U+FE0F), ️‍⚧ (U+1F3F3 U+FE0F U+200D U+26A7 U+FE0F), ️‍ (U+1F3F3 U+FE0F U+200D U+1F308) ]. Note: Some of the above emojis may not display properly on your system, because they're platform-specific
  • Do not break within emoji flag sequences. That is, do not break between regional indicator (RI) symbols if there is an odd number of RI characters before the break point.
    • [ State/region flags are typically created by 2 regional indicator symbols, for example (U+1F1FB U+1F1F3), (U+1F1FA U+1F1F8), (U+1F1EC U+1F1E7), (U+1F1EF U+1F1F5), (U+1F1E9 U+1F1EA), (U+1F1EB U+1F1F7), (U+1F1EA U+1F1FA), (U+1F1FA U+1F1F3) ]. Note: You may just see the letters, most notably on Windows because MS somehow refused to add flag emojis to their platform

Grapheme Cluster Boundary Rules


I'm afraid someone will have to explain to me why the world keeps using char when this seems obsolete.

As said, no one should iterate over chars in a string, regardless of whether the char is 1, 2 or 4-byte long. The most correct way is to iterate the graphemes instead. A good Unicode library like ICU will help you. This has also been mentioned in the Unicode document above

As far as a user is concerned, the underlying representation of text is not important, but it is important that an editing interface present a uniform implementation of what the user thinks of as characters. Grapheme clusters can be treated as units, by default, for processes such as the formatting of drop caps, as well as the implementation of text selection, arrow key movement or backspacing through text, and so forth. For example, when a grapheme cluster is represented internally by a character sequence consisting of base character + accents, then using the right arrow key would skip from the start of the base character to the end of the last accent.

Unicode Text Segmentation - Grapheme Cluster Boundaries

Unfortunately in many cases that's not possible due to a lack of a proper Unicode library, so in that case one can iterate the code points instead, but they need to be careful to avoid matching or cutting the strings in the middle of a grapheme, breaking the user-perceived character

In fact many modern languages prevent you from iterating bytes in a string by having a type typically called "rune" instead, which is actually UTF-32 under the hood, and avoid the classic char completely or just have that as a legacy type. For example in Go we have rune and in C# there's System.Text.Rune. In rust the strings are stored in UTF-8 but iterations are done in char (which represents a Unicode scalar value instead of byte):

for b in "नमस्ते".bytes() {
    println!("{}", b);
}

Loops in python3 are done in a similar way:

>>> s = "abc\u20ac\U00010302\U0010fffd"
>>> len(s)
6
>>> for c in s:
...     print('U+{:04X}'.format(ord(c)))
...     
U+0061
U+0062
U+0063
U+20AC
U+10302
U+10FFFD

As can be seen, you never loop through each byte in them. String iterations are done on runes instead, so the underlying string encoding is completely irrelevant. An implementation can use UTF-8, UTF-16, UTF-32 or any Unicode encodings and the user still don't know anything about that because they only interact with runes.