C read and write unsigned char (0 - 255) as UTF-8

637 views Asked by At

I am trying to read and write unsigned char (0 - 255) extended ASCII characters (unicode) from and to console under windows (cross platform compatibility is needed) in C.

Under extended ASCII (unicode), code-point 255 is ÿ and code-point 220 is Ü.

Right now I have the following code for writing and reading.

#include<stdio.h>
#include<locale.h>

int main() {
    setlocale(LC_ALL, "");

    unsigned char ch = 255;
    wprintf(L"Character %d = %lc\n", ch, ch);

    wprintf(L"Enter a character: ");
    wscanf(L"%lc", &ch);
    wprintf(L"Character %d = %lc\n", ch, ch);

    return 0;
}

The output is:

Character 255 = ÿ
Enter a character: ÿ
Character 220 = Ü

As evident, code-point 255 is displayed properly as ÿ. However, when taking ÿ as input, it is being read as code-point 220. Consequently, when code-point 220 is printed, it is displayed as Ü.

Thus, the writing is working fine. However, while reading, when the ASCII characters are above 127 (128 - 255), the read code-point is 36 less than the actual value.

Can you please help me understand what I am doing wrong and how I can fix this.

1

There are 1 answers

10
Schwern On

%lc takes a wide character wchar_t, wide refers to it being multi-byte, but the exact size is implementation specific. Giving it a 1 byte unsigned char will cause odd behavior as it will read a byte or two extra.

But if you're using 1 byte characters you don't need to use wprintf nor wscanf. Just use printf and scanf.

And, as noted by others, "extended ASCII" is not "Unicode". See this question for more.