How to handle char16_t or char32_t with printf and scanf in C?

3.8k views Asked by At

If I write:

char a = 'A';
printf("%x %c", a, a);

it will produce the output "41 A". Similary when I write

char32_t c = U'';
printf("%x %c", c, c);  //even tried %lc and %llc

it will produce the output "1f34c L" instead of expected "1f34c "!

Is there something wrong here? How can I print char16_t and char32_t characters onto stdout?

Also, which format specifier should I use to get char16_t / char32_t input from scanf?

char32_t c;
scanf("%c", &c); //
printf("%x %c", c, c);

this will produce the output "f0 �".

2

There are 2 answers

8
KamilCuk On BEST ANSWER

char16_t and char32_t are nothing special. They are really just uint_least16_t and uint_least32_t. They do not have that great support. The only thing they are used for are basically u and U literals. They may not be UTF-16 and UTF-32 - check __STDC_UTF_16__ and __STDC_UTF_32__ macros before assuming they are. Only very basic conversion functions are in standard. In the standard there are only functions to convert char16_t or char32_t into multibyte encoding, and back. To do anything more with them, you have to implement it yourself.

C language has really two encodings - locale dependent multibyte character representation and wide character representation.

Is there something wrong here?

The '' character you typed in your source file is interpreted by the compiler as a some implementation specific value. Gcc would make an UTF-8, then gcc preprocessor will shift the values left, so '' is equal to (int)0xF09F8D8C on gcc - the behavior of multi-character literals 'something' is implementation defined. Then the value of that character is assigned to char32_t. That is not at all an UTF-32 value.

How can I print char16_t and char32_t characters onto stdout?

Convert them to multibyte string. Then just print it with %s.

#include <stdlib.h>
#include <uchar.h>
#include <stdio.h>
#include <wchar.h>
#include <limits.h>
#include <string.h>
#include <errno.h>
#include <locale.h>
int main() {
    setlocale(LC_ALL, "en_US.UTF-8");
    char32_t c = U'';
    char buf[MB_LEN_MAX + 1] = {0};
    mbstate_t ps;
    memset(&ps, 0, sizeof(ps));
    c32rtomb(buf, c, &ps);
    printf("%s\n", buf);
}

Printing data is locale dependent, as printing is done in the locale specified by the user. The default locale is C and has no UTF support. So first you have to set your locale to something utf compatible. Then call c32rtomb. Note that stream chooses encoding at the first time it's printed in glibc - make sure to call setlocale before doing anything with the stream you want to work with.

which format specifier should I use to get char16_t / char32_t input from scanf?

None, there is none. You should use wchar_t or plain char strings to read characters from user in the encoding specified in his locale. Then you can convert to/from char16_t and char32_t if you want. If you want to specifically read UTF-32 characters, then you have to write it yourself to be sure your code reads UTF-32 characters. I recommend libunistring.

4
MONUDDIN TAMBOLI On

i have given value in HEX format symbol = 0x0001F34C there are other ways to solve it to this is how i know check following code in c we cannot print symbol using %c or just printf here is explain why to use wchar_t instead of char char have UTF-8 encoding and wchar_t have UTF-32 which increases its range

#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main() {
    setlocale(LC_CTYPE, "");
    wchar_t symbol = 0x0001F34C;
    wprintf(L"%x %lc\n",symbol,symbol);
}
output: 1f34c 

check this following link Printing a Unicode Symbol in C , UNICODE of emoji banana , char32_t