How to get single characters from unicode string and compare, print them?

748 views Asked by At

I am processing unicode strings in C with libunistring. Can't use another library. My goal is to read a single character from the unicode string at its index position, print it, and compare it to a fixed value. This should be really simple, but well ...

Here's my try (complete C program):

/* This file must be UTF-8 encoded in order to work */

#include <locale.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#include <unitypes.h>
#include <uniconv.h>
#include <unistdio.h>
#include <unistr.h>
#include <uniwidth.h>


int cmpchr(const char *label, const uint32_t charExpected, const uint32_t charActual) {
    int result = u32_cmp(&charExpected, &charActual, 1);
    if (result == 0) {
        printf("%s is recognized as '%lc', good!\n", label, charExpected);
    } else {
        printf("%s is NOT recognized as '%lc'.\n", label, charExpected);
    }
    return result;
}


int main() {
    setlocale(LC_ALL, "");     /* switch from default "C" encoding to system encoding */
    const char *enc = locale_charset();
    printf("Current locale charset: %s (should be UTF-8)\n\n", enc);

    const char *buf = "foo 楽あり bébé";
    const uint32_t *mbcs = u32_strconv_from_locale(buf);

    printf("%s\n", u32_strconv_to_locale(mbcs));

    uint32_t c0 = mbcs[0];
    uint32_t c5 = mbcs[5];
    uint32_t cLast = mbcs[u32_strlen(mbcs) - 1];

    printf(" - char 0: %lc\n", c0);
    printf(" - char 5: %lc\n", c5);
    printf(" - last  : %lc\n", cLast);

    /* When this file is UTF-8-encoded, I'm passing a UTF-8 character
     * as a uint32_t, which should be wrong! */
    cmpchr("Char 0", 'f', c0);
    cmpchr("Char 5", 'あ', c5);
    cmpchr("Last char", 'é', cLast);

    return 0;
}

In order to run this program:

  1. Save the program to a UTF-8 encoded file called ustridx.c
  2. sudo apt-get install libunistring-dev
  3. gcc -o ustridx.o -W -Wall -O -c ustridx.c ; gcc -o ustridx -lunistring ustridx.o
  4. Make sure the terminal is set to a UTF-8 locale (locale)
  5. Run it with ./ustridx

Output:

Current locale charset: UTF-8 (should be UTF-8)

foo 楽あり bébé
 - char 0: f
 - char 5: あ
 - last  : é
Char 0 is recognized as 'f', good!
Char 5 is NOT recognized as '�����'.
Last char is NOT recognized as '쎩'.

The desired behavior is that char 5 and last char are recognized correctly, and printed correctly in the last two lines of the output.

2

There are 2 answers

1
IlCapitano On BEST ANSWER

'あ' and 'é' are invalid character literals. Only characters from the basic source character set and escape sequences are allowed in character literals.

GCC however emits a warning (see godbolt) saying warning: multi-character character constant. This is a different case, and is about character constants such as 'abc', which are multicharacter literals. This is because these characters are encoded using multiple bytes with UTF-8. According to cppreference, the value of such a literal is implementation defined, so you can't rely on its value being the corresponding Unicode code point. GCC specifically doesn't do this as seen here.

Since C11 you can use UTF-32 character literals such as U'あ' which results in a char32_t value of the Unicode code point of the character. Although by my reading the standard doesn't allow using characters such as あ in literals, the examples on cppreference seem to suggest that it is common for compilers to allow this.
A standard-compliant portable solution is using Unicode escape sequences for the character literal, like U'\u3042' for あ, but this is hardly different from using an integer constant such as 0x3042.

1
Sam Varshavchik On

From libunistring's documentation:

 Compares S1 and S2, each of length N, lexicographically.  Returns a
 negative value if S1 compares smaller than S2, a positive value if
 S1 compares larger than S2, or 0 if they compare equal.

The comparison in the if statement was wrong. That was the reason for the mismatch. Of course, this reveals other, unrelated, issues that also need to be fixed. But, that's the reason for the puzzling result of the comparison.