Various programming languages use a 2-byte char
datatype (not to be confused with C/C++'s char
, which is just one byte) out of which strings are constructed. Various utility functions will try to find such a char
in a string, like looking for an e
in hello
, or do other operations that accept or return char
s (split, indexof, replace, count number of character occurrences in a string, length, ...).
If you dig deeper you will find out about Unicode code points. And indeed, Java (and I assume other languages as well) lets you iterate those code points. But those seem to be represented by an int
(4 bytes) not a char
(2 bytes). Very rarely, if ever will you see people using code points to iterate through a string. Since such a code point may span multiple char
s (max 2, right? int
?) it's not the fastest way to do string operations, but it does seem to be the correct way.
Some programs/frameworks/operating systems(?) will also fail to work correctly with multi-char
characters, instead only deleting the second char
of it and creating a "corrupted" character.
Shouldn't you always use the methods that operate on code points when dealing with strings? What am I missing? I'm afraid someone will have to explain to me why the world keeps using char
when this seems obsolete. Is the size of a char sufficient after all? I know there are additional "helper" characters for "upgrading" other characters (turn an o into ö and so on). How are these handled by char
and code point iteration? Isn't there a chance to horribly corrupt your string if you replace char
s instead of "whole" code points?
Note: this is a Western world point of view, in parallel we had Asian language history and evolution, which I skip. In any case, most of character set converted, with Unicode
Historically we had ASCII. In reality, we had also other character encoding, some also without distinguishing lower and upper case, but then ASCII became de facto standard (on western computers, which us Latin scripts). ASCII was not sufficient, so there were some extensions: "code pages", so still every character were 8-bit, but one could select which character set to use (and so which language to support).
All common-used, modern, operating systems were born on such epoch. So programs started with such convention, filesystems, API, text files, etc.
But Internet, and exchanging files were more and more common, so a file produced in Stockholm, were not completely readable in Germany, or in US. ISO standardized some code page (e.g. Latin-1, etc.), which had ASCII + some characters in common, and some parts were different according the encoding. And Windows used the Latin-1 and filled the non-allocated space (you see that it is described as "ANSI"). But also Asian scripts became also important (better computer, so we could effort much more characters for everyday use, not just for typesetting).
So Unicode (and ISO) started to develop a new standard. One set for every character, compatible with all most common charset (so you can transform into Unicode, and back, without losing information: this really helped to do a smooth transition). And such new charset should have 16-bit codepoints [WARNING: this is no more true, but it was so on first Unicode versions]. (for this reason, we have many combining characters, the "Han unification" (merging Chinese, Japanese, and old Korean characters into one), and the special case of encoding the new Korean characters.
New languages adopted such version, so 16bit Unicode characters.
Some operating systems added new API, with these 16-bit characters (Microsoft Windows did together with long names, on filesystem, and in a compatible way, so old computers could read files [just short names, and with 8-bit characters]). In this manner, you had compatibility with old programs, but new programs could (they were not forced) to use the new Unicode.
Old languages and Unix waited, struggling on how to get compatibly AND new Unicode.
This seems your world (as in your question), so early 1990s.
Guess what? 16-bit was not sufficient. So new (now already old) Unicode added planes, and surrogates. Surrogates were a trick to keep allocated 16-bit Unicode valid, but allowing (by using surrogates), to create characters to 0x10FFFF. This was also a difference with ISO, which allowed 31-bit code points.
In the same time, also UTF-8 emerged, so making compatible with ASCII (and with the end-of-string
\0
used by many libraries/operating systems), but allowing all new Unicode characters.Some more modern languages started to implement UTF-32 (so using 32-bit Unicode), some old adapted (e.g. new API), some just keep the surrogate, so changing "code-point" into "code-unit". Python is one of the exception: old language which converted to full Unicode (and now internally, it selects the optimal size 8-bit, 16-bit or 32-bit), but Python 3 conversion but very painful (and incompatible with old code, and after 10 years, many libraries were not ready), so I think other old language will think twice, before trying to "upgrade".
The problem on your "problem" is that to get 16-bit (or 32-bit) character, you need a flag day. Everybody should update, every program and every operating system, on the same day. So you should check old past code, and adapt. Or having two set of libraries and practically all operating system split in two: using old characters, or using the new one.
Personally, I think the Unix way is the best one, so using UTF-8: keep ASCII compatible, and extend. Old programs could process (transparently) Unicode characters, also if they were built before Unicode epoch (for printing, for storage, transmission, etc. obviously to get semantic of characters, they need to be Unicode aware).
Because of code unit (so two 16-bit code unit are sometime required for one Unicode code point), and combining characters (do not assume one glyphs is described just by one code-point), and variants selectors, emojis variants/tags, etc., it makes not much sense to iterate and modify single characters. And we should not forget that fonts may design a glyphs from various "characters".
So, it is too difficult to go to a UTF-32 globally, for all languages, because of existing programs and infrastructure. Now that UTF-8 seems to be dominant, I think we should keep UTF-8: so people will uses Unicode libraries, or just handle byte sequences transparently (maybe just merging, templating, etc.), maybe simple search (for ASCII, else the Unicode string must be normalized).