Assigning UTF8 char literal to char16_t - too many chars in char constant

769 views Asked by At

I'm creating a UTF8 table lookup for an embedded system. The table is used to convert a UTF8 encoded character to the bitmap index in a font (array).

I'm getting a warning "multicharacter character literal (potential portability problem)". Every entry in the "conversion_table" array is tagged with this warning.

Here's the code:

typedef struct UTF8_To_Bitmap_Index_s
{
    char16_t    encoded_character;
    uint8_t     bitmap_index;
} UTF8_To_Bitmap_Index_t;

size_t width_wchar_t = sizeof(wchar_t);

UTF8_To_Bitmap_Index_t conversion_table[] =
{
    {'¡', 0x00},
    {'À', 0x00},
    {'Á', 0x00},
    {'Ã', 0x00},
    {'Ä', 0x00},
    {'Å', 0x00},
    {'Ç', 0x00},
    {'É', 0x00},
    {'Í', 0x00},
    {'Ó', 0x00},
    {'Õ', 0x00},
    {'Ö', 0x00},
    {'Ø', 0x00},
    {'Ú', 0x00},
    {'Ü', 0x00},
    {'ß', 0x00},
    {'à', 0x00},
    {'á', 0x00},
    {'â', 0x00},
    {'ã', 0x00},
    {'ä', 0x00},
    {'å', 0x00},
    {'æ', 0x00},
    {'ç', 0x00},
    {'è', 0x00},
    {'é', 0x00},
    {'ê', 0x00},
    {'í', 0x00},
    {'ñ', 0x00},
    {'ó', 0x00},
    {'ô', 0x00},
};

Is there any method to change the above code to eliminate the warning?
(Note: the 0x00 is a placeholder until the actual bitmap index is determined.)

The data generated is correct:

     50          UTF8_To_Bitmap_Index_t conversion_table[] =
   \                     conversion_table:
   \   00000000   0xC2A1             DC16 49825
   \   00000002   0x00 0x00          DC8 0, 0
   \   00000004   0xC380             DC16 50048
   \   00000006   0x00 0x00          DC8 0, 0
   \   00000008   0xC381             DC16 50049
   \   0000000A   0x00 0x00          DC8 0, 0
   \   0000000C   0xC383             DC16 50051
   \   0000000E   0x00 0x00          DC8 0, 0
   \   00000010   0xC384             DC16 50052
   \   00000012   0x00 0x00          DC8 0, 0
   \   00000014   0xC385             DC16 50053
   \   00000016   0x00 0x00          DC8 0, 0
   \   00000018   0xC387             DC16 50055
   \   0000001A   0x00 0x00          DC8 0, 0
   \   0000001C   0xC389             DC16 50057
   \   0000001E   0x00 0x00          DC8 0, 0
   \   00000020   0xC38D             DC16 50061
   \   00000022   0x00 0x00          DC8 0, 0
   \   00000024   0xC393             DC16 50067
   \   00000026   0x00 0x00          DC8 0, 0
   \   00000028   0xC395             DC16 50069
   \   0000002A   0x00 0x00          DC8 0, 0
   \   0000002C   0xC396             DC16 50070
   \   0000002E   0x00 0x00          DC8 0, 0
   \   00000030   0xC398             DC16 50072
   \   00000032   0x00 0x00          DC8 0, 0
   \   00000034   0xC39A             DC16 50074
   \   00000036   0x00 0x00          DC8 0, 0
   \   00000038   0xC39C             DC16 50076
   \   0000003A   0x00 0x00          DC8 0, 0
   \   0000003C   0xC39F             DC16 50079
   \   0000003E   0x00 0x00          DC8 0, 0
   \   00000040   0xC3A0             DC16 50080
   \   00000042   0x00 0x00          DC8 0, 0
   \   00000044   0xC3A1             DC16 50081
   \   00000046   0x00 0x00          DC8 0, 0
   \   00000048   0xC3A2             DC16 50082
   \   0000004A   0x00 0x00          DC8 0, 0
   \   0000004C   0xC3A3             DC16 50083
   \   0000004E   0x00 0x00          DC8 0, 0
   \   00000050   0xC3A4             DC16 50084
   \   00000052   0x00 0x00          DC8 0, 0
   \   00000054   0xC3A5             DC16 50085
   \   00000056   0x00 0x00          DC8 0, 0
   \   00000058   0xC3A6             DC16 50086
   \   0000005A   0x00 0x00          DC8 0, 0
   \   0000005C   0xC3A7             DC16 50087
   \   0000005E   0x00 0x00          DC8 0, 0
   \   00000060   0xC3A8             DC16 50088
   \   00000062   0x00 0x00          DC8 0, 0
   \   00000064   0xC3A9             DC16 50089
   \   00000066   0x00 0x00          DC8 0, 0
   \   00000068   0xC3AA             DC16 50090
   \   0000006A   0x00 0x00          DC8 0, 0
   \   0000006C   0xC3AD             DC16 50093
   \   0000006E   0x00 0x00          DC8 0, 0
   \   00000070   0xC3B1             DC16 50097
   \   00000072   0x00 0x00          DC8 0, 0
   \   00000074   0xC3B3             DC16 50099
   \   00000076   0x00 0x00          DC8 0, 0
   \   00000078   0xC3B4             DC16 50100
   \   0000007A   0x00 0x00          DC8 0, 0

Resources:
Compiler -- IAR Embedded Workbench version 7.4
Target platform: ARM Cortex M

2

There are 2 answers

0
rici On BEST ANSWER

It's basically incorrect to try to store a UTF-8 encoded byte sequence in a char16_t, even if it would fit (and there's no guarantee of that in general, since UTF-8 code sequences can be from one to four bytes long). The intended purpose of char16_t is to store a single UTF-16 code value (which is not necessarily an entire character, but that's another story). [Note 1]

Of course, 16 bits is 16 bits, so you can mash two octets into a char16_t if you really want to. But don't expect the compiler to accept that without warnings.

If you absolutely know that the UTF-8 sequence is two bytes long, then you should store it in a char[2]. You can type-pun char[2] with char16_t if you want to be able to refer to the two characters as a scalar, but the strict aliasing rule is likely to get in your way. In addition, you'll need to think through the endianness issue which you are currently just gliding over.

When you receive a UTF-8 encoded sequence from a serial port (or a UTF-8 encoded file or socket, or whatever), you'll receive the first byte first, as stands to reason. If you map two of these characters onto a two-byte integer, the low-address byte of the integer will contain the first byte, and the high-address byte of the integer will contain the second byte. That's perfect if you use a big-endian architecture, where the high-order byte has the low address. Maybe you're working in a big-endian environment. But if not, you're likely to find that your input doesn't match the constant you've created.

As indicated by the warning you're seeing, there is no standard way to convert a two-byte sequence into an integer (and remember that in C, a character literal is an int, not a char). So a given compiler might do anything, including limiting the character literal to a single byte, but it's common for compilers to encode the multiple characters as though they were a base-256 number. Consequently, 'AB' and \x4142 both produce the integer 0x4142. But if you were to map that integer onto a char[4] on a little-endian machine, what you would see would be the byte sequence 0x42 0x41 0x00 0x00, which if you printed it to the console would appear as BA.

Depending on how you produce the two-byte key for the lookup table, that might or might not give you what you want. Regardless, it's not going to be portable (or even future-proof) because there is no standard mechanism for creating a 16-bit compile-time integer out of a two-byte UTF-8 encoding.

There's still one more piece to this puzzle, though. Your program appears to contain this:

    {'ß', 0x00},

But we know (even if we prefer to ignore the fact for simplicity) that there is no such thing as a character inside a computer. All that you'll find are 0s and 1s. If we were to be truly accurate, you wouldn't find those either, since there are no microscopic zeros breezing from electrode to electrode inside a serial bus; rather, there are subatomic phenomena which can be treated as though they fit two distinct states. But we don't need to descend to that level of physical description; it's sufficient to say that the file in which your program is held does not contain tiny characters but rather sequences of bits. And the question is, exactly what sequence of bits are there? In particular, which (and how many) bits are being show as ß? The answer is defined by the character encoding of the file.

My guess is that you composed that source file using an editor working with a UTF-8 encoding, so that the ß shows up as the two byte sequence C3 9F. Now, what happens when the compiler sees those two bytes?

The C standard does not require any particular encoding, but it allows compilers to treat their input as a sequence of single-byte characters each of which represents of the characters in the basic source character set, which doesn't include ß. The compiler has complete latitude as to how it will treat any byte which does not correspond with a character in the source character set, and furthermore how those bytes are mapped onto characters and character strings in the executable (which is allowed to use a different encoding than the source file.) This all gets a bit complicated; perhaps I'll add a full explanation later. Suffice it to say that many compilers just treat a byte as a byte, at least inside character and string literals; the byte is just passed through without regard to encoding. (Other compilers use a more sophisticated algorithm taking into account source and execution encodings, which might differ. But in the simple case, the results are identical.)

So that's why the compiler complains that 'ß' is more than one character: it is, since it is encoded as two bytes. (If you were using Latin-1 as both the source and execution character sets, then the ß would be just one byte, 0xDF, and the compiler wouldn't complain. But that wouldn't get you a UTF-8 conversion table.)

C11 (and contemporary C++ versions) privilege Unicode and the UTF-8 transmission encoding, which is entirely appropriate. It gets around some of the chaos of multiple locales by providing a syntax which allows you to specify Unicode character codes unambiguously using the basic source character set, and by providing string and character literal prefixes which describe the desired encoding. If you have such a compiler, you could write the ß as \u00DF, which is its Unicode code point, and include it in a UTF-8 string literal by using the u8 prefix: u8"\u00DF". [Note 2]

Notes

  1. Technically, char16_t is only identified with UTF-16 if the preprocessor macro __STDC_UTF_16__ is defined in uchar.h, and similarly for char32_t and __STDC_UTF_32__. But I still think it's fair to say that the intended use was Unicode encodings.

  2. If you wanted to use UTF-16 or UTF-32 encodings, you could make a char16_t[] string literal by writing u"\u00DF", or a char32_t[] string literal, U"\u00DF". Both of these would have two elements, including the NUL terminator. (One of those might be the same as the wide-character string literal, L"\u00DF", but that's dependent on the configured execution locale and compiler support.) You can also have char16_t and char32_t character literals. But note that u'\u00DF' has the value 0xDF, which is the Unicode codepoint for ß.

9
Daniel Kleinstein On

The code as-is is non-portable as per the standard (§6.4.4.4.2 and §6.4.4.4.10):

An integer character constant is a sequence of one or more multibyte characters enclosed in single-quotes, as in 'x'. A wide character constant is the same, except prefixed by the letter L, u, or U. ... The value of an integer character constant containing more than one character (e.g., 'ab'), […] is implementation-defined. ...


You are encoding your characters as char16_t, and also as per the standard you shouldn't use ' ' syntax but rather u' ' syntax:

enter image description here

this should solve your problem:

UTF8_To_Bitmap_Index_t conversion_table[] =
{
    {u'¡', 0x00},
    {u'À', 0x00},
    ...