Problem Statement: I am required to convert a generated string to UTF8 string, this generated string has extended ascii characters and I am on Linux system (2.6.32-358.el6.x86_64).
A POC is still in progress so I can only provide small code samples and complete solution can be posted only once ready.
Why I required UFT8 (I have extended ascii characters to be stored in a string which has to be UTF8).
How I am proceeding:
- Convert generated string to wchar_t string.
Please look at the below sample code
int main(){
char CharString[] = "Prova";
iconv_t cd;
wchar_t WcharString[255];
size_t size= mbstowcs(WcharString, CharString, strlen(CharString));
wprintf(L"%ls\n", WcharString);
wprintf(L"%s\n", WcharString);
printf("\n%zu\n",size);
}
One question here:
Output is
Prova?????
s
- Why the size is not printed here ?
- Why the second printf prints only one character.
- If I print size before both printed string then only 5 is printed and both strings are missing from console.
Moving on to Second Part:
Now that I will have a wchar_t string I want to convert it to UTF8 string
For this I was surfing through and found iconv will help here.
Question here These are the methods I found in manual
**iconv_t iconv_open(const char *, const char *);
size_t iconv(iconv_t, char **, size_t *, char **, size_t *);
int iconv_close(iconv_t);**
Do I need to convert back wchar_t array to char array to before feeding to iconv ?
Please provide suggestions on the above issues.
Extended ascii I am talking about please see letters i in the marked snapshot below
For your first question (which I am interpreting as "why is all the output not what I expect"):
Where does the '?????' come from? In the call
mbstowcs(WcharString, CharString, strlen(CharString))
, the last argument (strlen(CharString)
) is the length of the output buffer, not the length of the input string.mbstowcs
will not write more than that number of wide characters, including the NUL terminator. Since the conversion requires 6 wide characters including the terminator, and you are only allowing it to write 5 wide characters, the resulting wide character string is not NUL terminated, and when you try to print it out you end up printing garbage after the end of the converted string. Hence the?????
. You should use the size of the output buffer inwchar_t
's (255, in this case) instead.Why does the second
wprintf
only print one character? When you callwprintf
with a wide character string argument, you must use the%ls
format code (or, more accurately, the%s
conversion needs to be qualified with anl
length modifier). If you use%s
without thel
, thenwprintf
will interpret the string as achar*
, and it will convert each character to awchar_t
as it outputs it. However, since the argument is actually a wide character string, the firstwchar_t
in the string isL"p"
, which is the number0x70
in some integer size. That means that the second byte of thewchar_t
(counting from the end, since you have a little-endian architecture) is a 0, so if you treat the string as a string of characters, it will be terminated immediately after thep
. So only one character is printed.Why doesn't the last
printf
print anything? In C, an output stream can either be a wide stream or a byte stream, but you don't specify that when you open the stream. (And, in any case, standard output is already opened for you.) This is called the orientation of the stream. A newly opened stream is unoriented, and the orientation is fixed when you first output to the stream. If the first output call is a wide call, likewprintf
, then the stream is a wide stream; otherwise, it is a byte stream. Once set, the orientation is fixed and you can't use output calls of the wrong orientation. So theprintf
is illegal, and it does nothing other than raise an error.Now, let's move on to your second question: What do I do about it?
The first thing is that you need to be clear about what format the input is in, and how you want to output it. On Linux, it is somewhat unlikely that you will want to use
wchar_t
at all. The most likely cases for the input string are that it is already UTF-8, or that it is in some ISO-8859-x encoding. And the most likely cases for the output are the same: either it is UTF-8, or it is some ISO-8859-x encoding.Unfortunately, there is no way for your program to know what encoding the console is expecting. The output may not even be going to a console. Similarly, there is really no way for your program to know which ISO-8859-x encoding is being used in the input string. (If it is a string literal, the encoding might be specified when you invoke the compiler, but there is no standard way of providing the information.)
If you are having trouble viewing output because non-ascii characters aren't displaying properly, you should start by making sure that the console is configured to use the same encoding as the program is outputting. If the program is sending UTF-8 to a console which is displaying, say, ISO-8859-15, then the text will not display properly. In theory, your locale setting includes the encoding used by your console, but if you are using a remote console (say, through PuTTY from a Windows machine), then the console is not part of the Linux environment and the default locale may be incorrect. The simplest fix is to configure your console correctly, but it is also possible to change the Linux locale.
The fact that you are using
mbstowcs
from a byte string suggests that you believe that the original string is in UTF-8. So it seems unlikely that the problem is that you need to convert it to UTF-8.You can certainly use
iconv
to convert a string from one encoding to another; you don't need to go throughwchar_t
to do so. But you do need to know the actual input encoding and the desired output encoding.