I'm creating a UTF8 table lookup for an embedded system. The table is used to convert a UTF8 encoded character to the bitmap index in a font (array).
I'm getting a warning "multicharacter character literal (potential portability problem)". Every entry in the "conversion_table" array is tagged with this warning.
Here's the code:
typedef struct UTF8_To_Bitmap_Index_s
{
char16_t encoded_character;
uint8_t bitmap_index;
} UTF8_To_Bitmap_Index_t;
size_t width_wchar_t = sizeof(wchar_t);
UTF8_To_Bitmap_Index_t conversion_table[] =
{
{'¡', 0x00},
{'À', 0x00},
{'Á', 0x00},
{'Ã', 0x00},
{'Ä', 0x00},
{'Å', 0x00},
{'Ç', 0x00},
{'É', 0x00},
{'Í', 0x00},
{'Ó', 0x00},
{'Õ', 0x00},
{'Ö', 0x00},
{'Ø', 0x00},
{'Ú', 0x00},
{'Ü', 0x00},
{'ß', 0x00},
{'à', 0x00},
{'á', 0x00},
{'â', 0x00},
{'ã', 0x00},
{'ä', 0x00},
{'å', 0x00},
{'æ', 0x00},
{'ç', 0x00},
{'è', 0x00},
{'é', 0x00},
{'ê', 0x00},
{'í', 0x00},
{'ñ', 0x00},
{'ó', 0x00},
{'ô', 0x00},
};
Is there any method to change the above code to eliminate the warning?
(Note: the 0x00
is a placeholder until the actual bitmap index is determined.)
The data generated is correct:
50 UTF8_To_Bitmap_Index_t conversion_table[] =
\ conversion_table:
\ 00000000 0xC2A1 DC16 49825
\ 00000002 0x00 0x00 DC8 0, 0
\ 00000004 0xC380 DC16 50048
\ 00000006 0x00 0x00 DC8 0, 0
\ 00000008 0xC381 DC16 50049
\ 0000000A 0x00 0x00 DC8 0, 0
\ 0000000C 0xC383 DC16 50051
\ 0000000E 0x00 0x00 DC8 0, 0
\ 00000010 0xC384 DC16 50052
\ 00000012 0x00 0x00 DC8 0, 0
\ 00000014 0xC385 DC16 50053
\ 00000016 0x00 0x00 DC8 0, 0
\ 00000018 0xC387 DC16 50055
\ 0000001A 0x00 0x00 DC8 0, 0
\ 0000001C 0xC389 DC16 50057
\ 0000001E 0x00 0x00 DC8 0, 0
\ 00000020 0xC38D DC16 50061
\ 00000022 0x00 0x00 DC8 0, 0
\ 00000024 0xC393 DC16 50067
\ 00000026 0x00 0x00 DC8 0, 0
\ 00000028 0xC395 DC16 50069
\ 0000002A 0x00 0x00 DC8 0, 0
\ 0000002C 0xC396 DC16 50070
\ 0000002E 0x00 0x00 DC8 0, 0
\ 00000030 0xC398 DC16 50072
\ 00000032 0x00 0x00 DC8 0, 0
\ 00000034 0xC39A DC16 50074
\ 00000036 0x00 0x00 DC8 0, 0
\ 00000038 0xC39C DC16 50076
\ 0000003A 0x00 0x00 DC8 0, 0
\ 0000003C 0xC39F DC16 50079
\ 0000003E 0x00 0x00 DC8 0, 0
\ 00000040 0xC3A0 DC16 50080
\ 00000042 0x00 0x00 DC8 0, 0
\ 00000044 0xC3A1 DC16 50081
\ 00000046 0x00 0x00 DC8 0, 0
\ 00000048 0xC3A2 DC16 50082
\ 0000004A 0x00 0x00 DC8 0, 0
\ 0000004C 0xC3A3 DC16 50083
\ 0000004E 0x00 0x00 DC8 0, 0
\ 00000050 0xC3A4 DC16 50084
\ 00000052 0x00 0x00 DC8 0, 0
\ 00000054 0xC3A5 DC16 50085
\ 00000056 0x00 0x00 DC8 0, 0
\ 00000058 0xC3A6 DC16 50086
\ 0000005A 0x00 0x00 DC8 0, 0
\ 0000005C 0xC3A7 DC16 50087
\ 0000005E 0x00 0x00 DC8 0, 0
\ 00000060 0xC3A8 DC16 50088
\ 00000062 0x00 0x00 DC8 0, 0
\ 00000064 0xC3A9 DC16 50089
\ 00000066 0x00 0x00 DC8 0, 0
\ 00000068 0xC3AA DC16 50090
\ 0000006A 0x00 0x00 DC8 0, 0
\ 0000006C 0xC3AD DC16 50093
\ 0000006E 0x00 0x00 DC8 0, 0
\ 00000070 0xC3B1 DC16 50097
\ 00000072 0x00 0x00 DC8 0, 0
\ 00000074 0xC3B3 DC16 50099
\ 00000076 0x00 0x00 DC8 0, 0
\ 00000078 0xC3B4 DC16 50100
\ 0000007A 0x00 0x00 DC8 0, 0
Resources:
Compiler -- IAR Embedded Workbench version 7.4
Target platform: ARM Cortex M
It's basically incorrect to try to store a UTF-8 encoded byte sequence in a
char16_t
, even if it would fit (and there's no guarantee of that in general, since UTF-8 code sequences can be from one to four bytes long). The intended purpose ofchar16_t
is to store a single UTF-16 code value (which is not necessarily an entire character, but that's another story). [Note 1]Of course, 16 bits is 16 bits, so you can mash two octets into a
char16_t
if you really want to. But don't expect the compiler to accept that without warnings.If you absolutely know that the UTF-8 sequence is two bytes long, then you should store it in a
char[2]
. You can type-punchar[2]
withchar16_t
if you want to be able to refer to the two characters as a scalar, but the strict aliasing rule is likely to get in your way. In addition, you'll need to think through the endianness issue which you are currently just gliding over.When you receive a UTF-8 encoded sequence from a serial port (or a UTF-8 encoded file or socket, or whatever), you'll receive the first byte first, as stands to reason. If you map two of these characters onto a two-byte integer, the low-address byte of the integer will contain the first byte, and the high-address byte of the integer will contain the second byte. That's perfect if you use a big-endian architecture, where the high-order byte has the low address. Maybe you're working in a big-endian environment. But if not, you're likely to find that your input doesn't match the constant you've created.
As indicated by the warning you're seeing, there is no standard way to convert a two-byte sequence into an integer (and remember that in C, a character literal is an
int
, not achar
). So a given compiler might do anything, including limiting the character literal to a single byte, but it's common for compilers to encode the multiple characters as though they were a base-256 number. Consequently,'AB'
and\x4142
both produce the integer0x4142
. But if you were to map that integer onto achar[4]
on a little-endian machine, what you would see would be the byte sequence0x42 0x41 0x00 0x00
, which if you printed it to the console would appear asBA
.Depending on how you produce the two-byte key for the lookup table, that might or might not give you what you want. Regardless, it's not going to be portable (or even future-proof) because there is no standard mechanism for creating a 16-bit compile-time integer out of a two-byte UTF-8 encoding.
There's still one more piece to this puzzle, though. Your program appears to contain this:
But we know (even if we prefer to ignore the fact for simplicity) that there is no such thing as a character inside a computer. All that you'll find are 0s and 1s. If we were to be truly accurate, you wouldn't find those either, since there are no microscopic zeros breezing from electrode to electrode inside a serial bus; rather, there are subatomic phenomena which can be treated as though they fit two distinct states. But we don't need to descend to that level of physical description; it's sufficient to say that the file in which your program is held does not contain tiny characters but rather sequences of bits. And the question is, exactly what sequence of bits are there? In particular, which (and how many) bits are being show as
ß
? The answer is defined by the character encoding of the file.My guess is that you composed that source file using an editor working with a UTF-8 encoding, so that the
ß
shows up as the two byte sequenceC3 9F
. Now, what happens when the compiler sees those two bytes?The C standard does not require any particular encoding, but it allows compilers to treat their input as a sequence of single-byte characters each of which represents of the characters in the basic source character set, which doesn't include
ß
. The compiler has complete latitude as to how it will treat any byte which does not correspond with a character in the source character set, and furthermore how those bytes are mapped onto characters and character strings in the executable (which is allowed to use a different encoding than the source file.) This all gets a bit complicated; perhaps I'll add a full explanation later. Suffice it to say that many compilers just treat a byte as a byte, at least inside character and string literals; the byte is just passed through without regard to encoding. (Other compilers use a more sophisticated algorithm taking into account source and execution encodings, which might differ. But in the simple case, the results are identical.)So that's why the compiler complains that
'ß'
is more than one character: it is, since it is encoded as two bytes. (If you were using Latin-1 as both the source and execution character sets, then theß
would be just one byte, 0xDF, and the compiler wouldn't complain. But that wouldn't get you a UTF-8 conversion table.)C11 (and contemporary C++ versions) privilege Unicode and the UTF-8 transmission encoding, which is entirely appropriate. It gets around some of the chaos of multiple locales by providing a syntax which allows you to specify Unicode character codes unambiguously using the basic source character set, and by providing string and character literal prefixes which describe the desired encoding. If you have such a compiler, you could write the ß as
\u00DF
, which is its Unicode code point, and include it in a UTF-8 string literal by using theu8
prefix:u8"\u00DF"
. [Note 2]Notes
Technically,
char16_t
is only identified with UTF-16 if the preprocessor macro__STDC_UTF_16__
is defined inuchar.h
, and similarly forchar32_t
and__STDC_UTF_32__
. But I still think it's fair to say that the intended use was Unicode encodings.If you wanted to use UTF-16 or UTF-32 encodings, you could make a
char16_t[]
string literal by writingu"\u00DF"
, or achar32_t[]
string literal,U"\u00DF"
. Both of these would have two elements, including the NUL terminator. (One of those might be the same as the wide-character string literal,L"\u00DF"
, but that's dependent on the configured execution locale and compiler support.) You can also havechar16_t
andchar32_t
character literals. But note thatu'\u00DF'
has the value0xDF
, which is the Unicode codepoint for ß.