I know that this question might be very elementary. Please excuse me if this is something that is obvious. Consider the following program:
#include <stdio.h>
int main(void) {
// this is a string in English
char * str_1 = "This is a string.";
// this is a string in Russian
char * str_2 = "Это строковая константа.";
// iterator
int i;
// print English string as a string
printf("%s\n", str_1);
// print English string byte by byte
for(i = 0; str_1[i] != '\0'; i++) {
printf(" %c ",(char) str_1[i]);
}
printf("\n");
// print numerical values of English string byte by byte
for(i = 0; str_1[i] != '\0'; i++) {
printf("%03d ",(int) str_1[i]);
}
printf("\n");
// print Russian string as a string
printf("%s\n", str_2);
// print Russian string byte by byte
for(i = 0; str_2[i] != '\0'; i++) {
printf(" %c ",(char) str_2[i]);
}
printf("\n");
// print numerical values of Russian string byte by byte
for(i = 0; str_2[i] != '\0'; i++) {
printf("%03d ",(int) str_2[i]);
}
printf("\n");
return(0);
}
Output:
This is a string.
T h i s i s a s t r i n g .
084 104 105 115 032 105 115 032 097 032 115 116 114 105 110 103 046
Это строковая константа.
▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ ▒ .
-48 -83 -47 -126 -48 -66 032 -47 -127 -47 -126 -47 -128 -48 -66 -48 -70 -48 -66 -48 -78 -48 -80 -47 -113 032 -48 -70 -48 -66 -48 -67 -47 -127 -47 -126 -48 -80 -48 -67 -47 -126 -48 -80 046
It can be seen that an English (ASCII) string can be printed as a string or accessed using array indexes and printed character by character (byte by byte), but a Russian string (I believe encoded as UTF-8) can be printed as a string but not accessed character by character.
I understand that the reason why is that in this case the Russian characters are encoded using two bytes instead of one.
What I am wondering is whether there is any easy way to print a Unicode string character by character (in this case two bytes by two bytes) using standard C library functions by proper declaration of a data type or by labeling the string somehow or by setting a locale or in some other way.
I tried preceding the Russian string by "u8", that is char * str_2 = u8"..."
, but this doesn't change the behavior. I'd like to stay away from using wide characters that make assumptions about what language is being used, for example exactly two bytes per character. Any advice would be appreciated.
Here's a simple solution using the
sscanf
function. C99 requires that bothprintf
andscanf
(and friends) understand thel
size qualifier to the%s
and%c
character codes, causing them to convert between multibyte (i.e. UTF-8) representation and wide string/character (i.e.wchar_t
, which is an integer type large enough to contain a codepoint). That means you can use it to take a string apart one (multibyte) character at a time, without worrying about whether the sequence is just seven-bit characters (English) or not. (If that sounds complicated, look at the code. Essentially, it just adds anl
qualifier to the format strings.)This does use
wchar_t
, which may be restricted to 16 bits on some platforms (Windows, cough, cough). I suspect if you use astral plane characters on Windows, you'll end up with surrogate characters, which are likely to cause you grief, but the code works fine on both Linux and Mac, at least in not-too-ancient versions.Note the call to
setlocale
at the beginning of the program. That's necessary for any wide-character function to work; it sets the execution locale to the default system locale, which will normally be a locale in which multibyte characters are UTF-8. (However, the code below doesn't really care. It just requires that the input to the function be in the multibyte representation specified by the current locale.)It might not be the fastest solution to this problem, but it has the advantage of being quite a bit simpler to write, at least in my opinion.
The following is based on the original code, but I refactored the output into a single function for simplicity. I also changed the numerical output to hexadecimal (because it's easier to verify with code charts).
The above attempts to mimic the code in the OP. I would actually prefer to write the loop using a pointer instead of an index and checking the return code of
sscanf
as a termination condition:Even better would be to make sure
sscanf
is not returning an error, indicating that there was an invalid multibyte sequence.Here's the output on my system: