I am processing unicode strings in C with libunistring. Can't use another library. My goal is to read a single character from the unicode string at its index position, print it, and compare it to a fixed value. This should be really simple, but well ...
Here's my try (complete C program):
/* This file must be UTF-8 encoded in order to work */
#include <locale.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unitypes.h>
#include <uniconv.h>
#include <unistdio.h>
#include <unistr.h>
#include <uniwidth.h>
int cmpchr(const char *label, const uint32_t charExpected, const uint32_t charActual) {
int result = u32_cmp(&charExpected, &charActual, 1);
if (result == 0) {
printf("%s is recognized as '%lc', good!\n", label, charExpected);
} else {
printf("%s is NOT recognized as '%lc'.\n", label, charExpected);
}
return result;
}
int main() {
setlocale(LC_ALL, ""); /* switch from default "C" encoding to system encoding */
const char *enc = locale_charset();
printf("Current locale charset: %s (should be UTF-8)\n\n", enc);
const char *buf = "foo 楽あり bébé";
const uint32_t *mbcs = u32_strconv_from_locale(buf);
printf("%s\n", u32_strconv_to_locale(mbcs));
uint32_t c0 = mbcs[0];
uint32_t c5 = mbcs[5];
uint32_t cLast = mbcs[u32_strlen(mbcs) - 1];
printf(" - char 0: %lc\n", c0);
printf(" - char 5: %lc\n", c5);
printf(" - last : %lc\n", cLast);
/* When this file is UTF-8-encoded, I'm passing a UTF-8 character
* as a uint32_t, which should be wrong! */
cmpchr("Char 0", 'f', c0);
cmpchr("Char 5", 'あ', c5);
cmpchr("Last char", 'é', cLast);
return 0;
}
In order to run this program:
- Save the program to a UTF-8 encoded file called ustridx.c
sudo apt-get install libunistring-dev
gcc -o ustridx.o -W -Wall -O -c ustridx.c ; gcc -o ustridx -lunistring ustridx.o
- Make sure the terminal is set to a UTF-8 locale (
locale
) - Run it with
./ustridx
Output:
Current locale charset: UTF-8 (should be UTF-8)
foo 楽あり bébé
- char 0: f
- char 5: あ
- last : é
Char 0 is recognized as 'f', good!
Char 5 is NOT recognized as '�����'.
Last char is NOT recognized as '쎩'.
The desired behavior is that char 5 and last char are recognized correctly, and printed correctly in the last two lines of the output.
'あ'
and'é'
are invalid character literals. Only characters from the basic source character set and escape sequences are allowed in character literals.GCC however emits a warning (see godbolt) saying
warning: multi-character character constant
. This is a different case, and is about character constants such as'abc'
, which are multicharacter literals. This is because these characters are encoded using multiple bytes with UTF-8. According to cppreference, the value of such a literal is implementation defined, so you can't rely on its value being the corresponding Unicode code point. GCC specifically doesn't do this as seen here.Since C11 you can use UTF-32 character literals such as
U'あ'
which results in achar32_t
value of the Unicode code point of the character. Although by my reading the standard doesn't allow using characters such as あ in literals, the examples on cppreference seem to suggest that it is common for compilers to allow this.A standard-compliant portable solution is using Unicode escape sequences for the character literal, like
U'\u3042'
for あ, but this is hardly different from using an integer constant such as0x3042
.