Required to convert a String to UTF8 string

1.1k views Asked by At

Problem Statement: I am required to convert a generated string to UTF8 string, this generated string has extended ascii characters and I am on Linux system (2.6.32-358.el6.x86_64).

A POC is still in progress so I can only provide small code samples and complete solution can be posted only once ready.

Why I required UFT8 (I have extended ascii characters to be stored in a string which has to be UTF8).

How I am proceeding:

  • Convert generated string to wchar_t string.

Please look at the below sample code

int main(){
  char  CharString[] = "Prova";
  iconv_t cd;
  wchar_t  WcharString[255];

  size_t size= mbstowcs(WcharString, CharString, strlen(CharString));

  wprintf(L"%ls\n", WcharString);

  wprintf(L"%s\n", WcharString);

  printf("\n%zu\n",size);
}

One question here:

Output is

Prova?????

s

  1. Why the size is not printed here ?
  2. Why the second printf prints only one character.
  3. If I print size before both printed string then only 5 is printed and both strings are missing from console.


Moving on to Second Part:

Now that I will have a wchar_t string I want to convert it to UTF8 string

For this I was surfing through and found iconv will help here.

Question here These are the methods I found in manual

**iconv_t iconv_open(const char *, const char *);

size_t  iconv(iconv_t, char **, size_t *, char **, size_t *);

int     iconv_close(iconv_t);**

Do I need to convert back wchar_t array to char array to before feeding to iconv ?

Please provide suggestions on the above issues.

Extended ascii I am talking about please see letters i in the marked snapshot below enter image description here

2

There are 2 answers

1
rici On BEST ANSWER

For your first question (which I am interpreting as "why is all the output not what I expect"):

  1. Where does the '?????' come from? In the call mbstowcs(WcharString, CharString, strlen(CharString)), the last argument (strlen(CharString)) is the length of the output buffer, not the length of the input string. mbstowcs will not write more than that number of wide characters, including the NUL terminator. Since the conversion requires 6 wide characters including the terminator, and you are only allowing it to write 5 wide characters, the resulting wide character string is not NUL terminated, and when you try to print it out you end up printing garbage after the end of the converted string. Hence the ?????. You should use the size of the output buffer in wchar_t's (255, in this case) instead.

  2. Why does the second wprintf only print one character? When you call wprintf with a wide character string argument, you must use the %ls format code (or, more accurately, the %s conversion needs to be qualified with an l length modifier). If you use %s without the l, then wprintf will interpret the string as a char*, and it will convert each character to a wchar_t as it outputs it. However, since the argument is actually a wide character string, the first wchar_t in the string is L"p", which is the number 0x70 in some integer size. That means that the second byte of the wchar_t (counting from the end, since you have a little-endian architecture) is a 0, so if you treat the string as a string of characters, it will be terminated immediately after the p. So only one character is printed.

  3. Why doesn't the last printf print anything? In C, an output stream can either be a wide stream or a byte stream, but you don't specify that when you open the stream. (And, in any case, standard output is already opened for you.) This is called the orientation of the stream. A newly opened stream is unoriented, and the orientation is fixed when you first output to the stream. If the first output call is a wide call, like wprintf, then the stream is a wide stream; otherwise, it is a byte stream. Once set, the orientation is fixed and you can't use output calls of the wrong orientation. So the printf is illegal, and it does nothing other than raise an error.


Now, let's move on to your second question: What do I do about it?

The first thing is that you need to be clear about what format the input is in, and how you want to output it. On Linux, it is somewhat unlikely that you will want to use wchar_t at all. The most likely cases for the input string are that it is already UTF-8, or that it is in some ISO-8859-x encoding. And the most likely cases for the output are the same: either it is UTF-8, or it is some ISO-8859-x encoding.

Unfortunately, there is no way for your program to know what encoding the console is expecting. The output may not even be going to a console. Similarly, there is really no way for your program to know which ISO-8859-x encoding is being used in the input string. (If it is a string literal, the encoding might be specified when you invoke the compiler, but there is no standard way of providing the information.)

If you are having trouble viewing output because non-ascii characters aren't displaying properly, you should start by making sure that the console is configured to use the same encoding as the program is outputting. If the program is sending UTF-8 to a console which is displaying, say, ISO-8859-15, then the text will not display properly. In theory, your locale setting includes the encoding used by your console, but if you are using a remote console (say, through PuTTY from a Windows machine), then the console is not part of the Linux environment and the default locale may be incorrect. The simplest fix is to configure your console correctly, but it is also possible to change the Linux locale.

The fact that you are using mbstowcs from a byte string suggests that you believe that the original string is in UTF-8. So it seems unlikely that the problem is that you need to convert it to UTF-8.

You can certainly use iconv to convert a string from one encoding to another; you don't need to go through wchar_t to do so. But you do need to know the actual input encoding and the desired output encoding.

2
ikrabbe On

It's no good idea to use iconv for utf8. Just implement the definition of utf8 yourself. That is quite easily in done in C from the Description https://en.wikipedia.org/wiki/UTF-8. You don't even need wchar_t, just use uint32_t for your characters. You will learn much if you implement yourself and your program will gain speed from not using mb or iconv functions.