How can the a surrogate code point (0xD800-0xDFFF) in C++ cause program errors?

108 views Asked by At

The following is a description of the cppreference:

cppreference

Range of universal character names

...

If a universal character name corresponds surrogate code point (the range 0xD800-0xDFFF, inclusive), the program is ill-formed.

I had a crash due to emoji characters, but I can't repeat it. I want to know is it because of the range 0xD800-0xDFFF problem?

Why is there a compile error at a:, but no runtime error at b:?

My example:

// ""  \uD83D\uDC76
// a:
// std::cout << "\uD83D\uDC76" << std::endl; // compile error

unsigned char e1 = 0xD8;
unsigned char e2 = 0x3D;
unsigned char e3 = 0xDC;
unsigned char e4 = 0x76;

unsigned char e[] = {e1, e2, e3, e4, '\0'};
// b:
std::cout << "e:" << e << std::endl; // Print �=�v, but there are no runtime error.
1

There are 1 answers

10
Giacomo Catenazzi On BEST ANSWER

You get different behaviour, because the code is semantically different.

First: the interval of code-points you cited are named "surrogates", and they are used to encode codepoints above the old 16-bit limit with the old UCS-2 encoding (so 2 bytes per character): using two surrogates (and in correct order) one can describe all Unicode codepoints. But so, it will not be possible to describe surrogate codepoints (they will be interpreted as part of a two code units).

Now, for your case: the first case you get exact such problem: you are using an invalid codepoint, and also possibly the encoding of sources is UTF-8, so the compiler may be confused: do you mean it need to merge the two codepoints and inject UTF-8 of all sequence, or just a invalid (but possible) two UTF-8 codepoints? (the later is named CESU-8 encoding)

The second part is different: you create an array of bytes, the compiler will not try to interpret it as text: just as a sequence of bytes. In fact you can put binary data. So it is valid. Just the output is explicitly encoded as UCS-2 (or also UTF-16 if the surrogates are put in the correct order), so the terminal must interpret the data correctly. Note: you also forced a precise type of UTF-16 or UCS-2 (the endianess may be different, so on different architectures you may get different result and so not surrogates).

Note: some operating system and languages quickly went to UCS-2, so you may get some documentation which uses surrogates explicitly (see e.g. JavaScript, where you may have some function which work at UCS-2 (or "Unicode code units") and other at UTF-16 ("Unicode code points"), so you may find emojii described as code units (2 bytes units) and not code points (2 or 4 bytes if the code units are surrogates).

Also note: in parallel to UTF-8, there is also CESU-8, which practically it is a UTF-8 encoding, but at UCS-2 level, so encoding each surrogate parts separately.