Accessing foreign-language string character by character

1.1k views Asked by At

I know that this question might be very elementary. Please excuse me if this is something that is obvious. Consider the following program:

#include <stdio.h>

int main(void) {
   // this is a string in English
   char * str_1 = "This is a string.";
   // this is a string in Russian
   char * str_2 = "Это строковая константа.";
   // iterator
   int i;
   // print English string as a string
   printf("%s\n", str_1);
   // print English string byte by byte
   for(i = 0; str_1[i] != '\0'; i++) {
      printf(" %c  ",(char) str_1[i]);
   }
   printf("\n");
   // print numerical values of English string byte by byte
   for(i = 0; str_1[i] != '\0'; i++) {
      printf("%03d ",(int) str_1[i]);
   }
   printf("\n");
   // print Russian string as a string
   printf("%s\n", str_2);
   // print Russian string byte by byte
   for(i = 0; str_2[i] != '\0'; i++) {
      printf(" %c  ",(char) str_2[i]);
   }
   printf("\n");
   // print numerical values of Russian string byte by byte
   for(i = 0; str_2[i] != '\0'; i++) {
      printf("%03d ",(int) str_2[i]);
   }
   printf("\n");
   return(0);
}

Output:

This is a string.
 T   h   i   s       i   s       a       s   t   r   i   n   g   .
084 104 105 115 032 105 115 032 097 032 115 116 114 105 110 103 046
Это строковая константа.
 ▒   ▒   ▒   ▒   ▒   ▒       ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒       ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   .
-48 -83 -47 -126 -48 -66 032 -47 -127 -47 -126 -47 -128 -48 -66 -48 -70 -48 -66 -48 -78 -48 -80 -47 -113 032 -48 -70 -48 -66 -48 -67 -47 -127 -47 -126 -48 -80 -48 -67 -47 -126 -48 -80 046

It can be seen that an English (ASCII) string can be printed as a string or accessed using array indexes and printed character by character (byte by byte), but a Russian string (I believe encoded as UTF-8) can be printed as a string but not accessed character by character.

I understand that the reason why is that in this case the Russian characters are encoded using two bytes instead of one.

What I am wondering is whether there is any easy way to print a Unicode string character by character (in this case two bytes by two bytes) using standard C library functions by proper declaration of a data type or by labeling the string somehow or by setting a locale or in some other way.

I tried preceding the Russian string by "u8", that is char * str_2 = u8"...", but this doesn't change the behavior. I'd like to stay away from using wide characters that make assumptions about what language is being used, for example exactly two bytes per character. Any advice would be appreciated.

3

There are 3 answers

4
rici On

Here's a simple solution using the sscanf function. C99 requires that both printf and scanf (and friends) understand the l size qualifier to the %s and %c character codes, causing them to convert between multibyte (i.e. UTF-8) representation and wide string/character (i.e. wchar_t, which is an integer type large enough to contain a codepoint). That means you can use it to take a string apart one (multibyte) character at a time, without worrying about whether the sequence is just seven-bit characters (English) or not. (If that sounds complicated, look at the code. Essentially, it just adds an l qualifier to the format strings.)

This does use wchar_t, which may be restricted to 16 bits on some platforms (Windows, cough, cough). I suspect if you use astral plane characters on Windows, you'll end up with surrogate characters, which are likely to cause you grief, but the code works fine on both Linux and Mac, at least in not-too-ancient versions.

Note the call to setlocale at the beginning of the program. That's necessary for any wide-character function to work; it sets the execution locale to the default system locale, which will normally be a locale in which multibyte characters are UTF-8. (However, the code below doesn't really care. It just requires that the input to the function be in the multibyte representation specified by the current locale.)

It might not be the fastest solution to this problem, but it has the advantage of being quite a bit simpler to write, at least in my opinion.

The following is based on the original code, but I refactored the output into a single function for simplicity. I also changed the numerical output to hexadecimal (because it's easier to verify with code charts).

#include <locale.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <wchar.h>

/* Print the string three ways */
void print3(const char* s);

void print3(const char* s) {
   wchar_t wch;
   int n;
   // print as a string
   printf("%s\n", s);
   // print char by char
   for (int i = 0; s[i] != '\0'; i += n) {
      sscanf(s+i, "%lc%n", &wch, &n);
      printf("   %lc   ", wch);
   }
   putchar('\n');
   // print numerical values char by char 
   for (int i = 0; s[i] != '\0'; i += n) {
      sscanf(s+i, "%lc%n", &wch, &n);
      printf(" %05lx ", (unsigned long)wch);
   }
   putchar('\n');
}

int main(void) {
   setlocale(LC_ALL, "");

   char *str_1 = "This is a string.";
   char *str_2 = "Это строковая константа.";
   char *str_3 = u8"\U0001d7d8\U0001d7d9\U0001f638 in the astral plane";

   print3(str_1);
   print3(str_2);
   print3(str_3);
   return 0;
}

The above attempts to mimic the code in the OP. I would actually prefer to write the loop using a pointer instead of an index and checking the return code of sscanf as a termination condition:

/* Print the string three ways */
void print3(const char* s) {
   wchar_t wch;
   int n;
   // print as a string
   printf("%s\n", s);
   // print char by char
   for (const char* p = s;
        sscanf(p, "%lc%n", &wch, &n) > 0;
        p += n) {
           printf("   %lc   ", wch);
   }
   putchar('\n');
   for (const char* p = s;
        sscanf(p, "%lc%n", &wch, &n) > 0;
        p += n) {
           printf(" %5.4lx ", (unsigned long)wch);
   }
   putchar('\n');
}

Even better would be to make sure sscanf is not returning an error, indicating that there was an invalid multibyte sequence.

Here's the output on my system:

This is a string.
   T      h      i      s             i      s             a             s      t      r      i      n      g      .   
  0054   0068   0069   0073   0020   0069   0073   0020   0061   0020   0073   0074   0072   0069   006e   0067   002e 
Это строковая константа.
   Э      т      о             с      т      р      о      к      о      в      а      я             к      о      н      с      т      а      н      т      а      .   
  042d   0442   043e   0020   0441   0442   0440   043e   043a   043e   0432   0430   044f   0020   043a   043e   043d   0441   0442   0430   043d   0442   0430   002e 
 in the astral plane
                            i      n             t      h      e             a      s      t      r      a      l             p      l      a      n      e   
 1d7d8  1d7d9  1f638   0020   0069   006e   0020   0074   0068   0065   0020   0061   0073   0074   0072   0061   006c   0020   0070   006c   0061   006e   0065
0
Jonathan Leffler On

I think the mblen(), mbtowc(), wctomb(), mbstowcs() and wcstombs() functions from <stdlib.h> are partially relevant. You can find out how many bytes make up each character in the string with mblen(), for example.

Another seldom-used header and function that's material is <locale.h> and setlocale().

Here's an adaptation of your code:

#include <assert.h>
#include <locale.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

static inline void ntbs_hex_dump(const char *pc_ntbs)
{
    unsigned char *ntbs = (unsigned char *)pc_ntbs;
    for (int i = 0; ntbs[i] != '\0'; i++)
        printf(" %.2X ", ntbs[i]);
    putchar('\n');
}

static inline void ntbs_chr_dump(const char *pc_ntbs)
{
    unsigned char *ntbs = (unsigned char *)pc_ntbs;
    for (int i = 0; ntbs[i] != '\0'; i++)
        printf(" %c  ", ntbs[i]);
    putchar('\n');
}

int main(void)
{
    char *loc = setlocale(LC_ALL, "");
    printf("Locale: %s\n", loc);

    char *str_1 = "This is a string.";
    char *str_2 = "Это строковая константа.";

    printf("English:\n");
    printf("%s\n", str_1);
    ntbs_chr_dump(str_1);
    ntbs_hex_dump(str_1);

    printf("Russian:\n");
    printf("%s\n", str_2);
    ntbs_chr_dump(str_2);
    ntbs_hex_dump(str_2);

    char *mbp = str_2;
    while (*mbp != '\0')
    {
        enum { MBS_LEN = 10 };
        int mbl = mblen(mbp, strlen(mbp));
        char mbs[MBS_LEN];
        assert(mbl < MBS_LEN - 1 && mbl > 0);
        // printf("mbl = %d\n", mbl);
        memmove(mbs, mbp, mbl);
        mbs[mbl] = '\0';
        printf(" %s ", mbs);
        mbp += mbl;
    }
    putchar('\n');

    return(0);
}

The setlocale() is important, at least on macOS Sierra 10.12.2 (with GCC 6.3.0), which is where I developed and tested it. Without that, mblen() always returns 1, and there is no benefit in the code.

The output I get from that is:

Locale: en_US.UTF-8
English:
This is a string.
 T   h   i   s       i   s       a       s   t   r   i   n   g   .  
 54  68  69  73  20  69  73  20  61  20  73  74  72  69  6E  67  2E 
Russian:
Это строковая константа.
 ?   ?   ?   ?   ?   ?       ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?       ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   .  
 D0  AD  D1  82  D0  BE  20  D1  81  D1  82  D1  80  D0  BE  D0  BA  D0  BE  D0  B2  D0  B0  D1  8F  20  D0  BA  D0  BE  D0  BD  D1  81  D1  82  D0  B0  D0  BD  D1  82  D0  B0  2E 
 Э  т  о     с  т  р  о  к  о  в  а  я     к  о  н  с  т  а  н  т  а  . 

With a bit more effort, the code could print the pairs of bytes for the UTF-8 data more closely together. The D0 and D1 leading bytes are correct for the UTF-8 encoding of the Cyrillic code block U+0400 .. U+04FF in the BMP (basic multilingual plane).

Just for your amusement value: the BSD sed refused to process the output because those question marks represent invalid codes: sed: RE error: illegal byte sequence.

1
DYZ On

Your were correctly advised to write your own UTF-8 parser, which is actually quite easy to do. Here's a sample implementation:

int utf8decode(unsigned char *utf8, unsigned *code) {
  while(*utf8) { /* Scan the whole string */
    if ((utf8[0] & 128) == 0) { /* Handle single-byte characters */
      *code = utf8[0];
      utf8++;
    } else { /* Looks like it's a 2-byte character; is it? */
      if ((utf8[0] >> 5) != 6 || (utf8[1] >> 6) != 2)
        return 1;
      /* Yes, it is; do bit magic */
      *code = ((utf8[0] & 31) << 6) + (utf8[1] & 63);
      utf8 += 2;
    }
    code++;
  }
  *code = 0;
  return 0; /* We got it! */
}

Let's do some testing:

int main(void) {
  int i = 0;
  unsigned char *str = "Это строковая константа.";
  unsigned codes[1024]; /* Hope it's long enough */ 

  if (utf8decode(str, codes) == 1) /* Decode */
    return 1;
  while(codes[i]) /* Print the result */
    printf("%u ", codes[i++]); 

  puts(""); /* Final newline */
  return 0;
}

1069 1090 1086 32 1089 1090 1088 1086 1082 1086 1074 1072 1103 32 1082 1086 1085 1089 1090 1072 1085 1090 1072 46