Is there a language(s) which will require three or more bytes per character when encoded using UTF-8? Which ones?

2.9k views Asked by At

Commonly used ofc, Klingon doesnt count :-)

thanks, guys, let me run willItFit() testcases

OK, now i figured out what saving bytes with UTF-8 is causing more problems than solving, thanks again

4

There are 4 answers

0
Michael Madsen On BEST ANSWER

Characters requiring 3 bytes start at U+0800 and all subsequent characters, so that's a HUGE number of potential characters. This includes East Asian scripts such as Japanese, Chinese, Korean, and Thai.

For a complete list of script ranges, you can refer to Unicode's block data. Only these blocks can be represented with 1 or 2 bytes, characters from all other blocks require 3 or 4 bytes:

0000..007F Basic Latin
0080..00FF Latin-1 Supplement
0100..017F Latin Extended-A
0180..024F Latin Extended-B
0250..02AF IPA Extensions
02B0..02FF Spacing Modifier Letters
0300..036F Combining Diacritical Marks
0370..03FF Greek and Coptic
0400..04FF Cyrillic
0500..052F Cyrillic Supplement
0530..058F Armenian
0590..05FF Hebrew
0600..06FF Arabic
0700..074F Syriac
0750..077F Arabic Supplement
0780..07BF Thaana
07C0..07FF NKo
0
David Moye On

There are representations of many Asian languages that use more than 2 bytes. While it's true that they probably don't specifically need to, Japanese and Korean (at least) are often represented in multi-byte form.

0
Andrey On

Here we go:

So the first 128 characters (US-ASCII) need one byte. The next 1,920 characters need two bytes to encode. This includes Latin letters with diacritics and characters from Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tāna alphabets. Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use). Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters and various historic scripts.

More details:

http://en.wikipedia.org/wiki/Mapping_of_Unicode_character_planes , Basic Multilingual Plane, Codes from 0x8000.

Some examples: Indic scripts, Thai, Philippine scripts, Hiragana, Katakana. So all East Asia scripts and some other.

1
Jörg W Mittag On

You even need three bytes just for English. For example, the typographically correct apostrophe is encoded in UTF-8 as 0xE2 0x80 0x99, opening quote marks are 0xE2 0x80 0x9C and closing quote marks are 0xE2 0x80 0x9D. The ellipsis is 0xE2 0x80 0xA6. And that's not even talking about all the different dashes, spaces or the inch and feet signs.

“It’s kinda hard to write English without the apostrophe’s help …”