Convert UTF-16LE to UTF-8 in C

2.1k views Asked by At

I am using a library which has a function that returns result strings encoded as UTF-16LE (I'm pretty sure) in a standard char *, as well as the number of bytes in the string. I would like to convert these strings to UTF-8. I tried the solution from this question: Convert UTF-16 to UTF-8 under Windows and Linux, in C which says to use iconv, however the result was that both input and output buffers wound up empty. What am I missing?

My input and output buffers are declared and initialized as follows:

char *resbuff=NULL;
char *outbuff=NULL;
int stringLen;
size_t outbytes=1024;
size_t inbytes;
size_t convResult;
...
//some loop and control code here
...
if (resbuff==NULL) {
    resbuff=(char *)malloc(1024);
    outbuff=(char *)malloc(1024);
}

I then call the library function to fill rebuff with data. Looking at the buffer in the debugger I can see the data in the buffer. For example, if the data is "test", I would see the following looking at the individual indexes of rebuff:

't','\0','e','\0','s','\0','t','\0'

Which I believe is UTF-16LE (other code using the same library would appear to confirm this), and stringlen now equals 8. I then try to convert that to UTF-8 using the following code:

iconv_t conv;
conv=iconv_open("UTF-8", "UTF-16LE");
inbytes=stringLen;
convResult=iconv(conv,&resbuff,&inbytes,&outbuff,&outbytes); //this does return 0
iconv_close(conv);

With the result that outbuff and resbuff both end up as null strings.

Note that I declare stringlen as an int rather than an unsigned long because that is what the library function is expecting.

EDIT: I tweaked my code slightly as per John Bollinger's answer below, but it didn't change the outcome.

EDIT 2: Ultimately the output from this code will be used in Python, so I'm thinking that while it might be uglier, I'll just perform the string conversion there. It just works.

1

There are 1 answers

5
John Bollinger On

You do not show the declaration or initialization of variables stringLen and outbytes, and your problem might well lie there. However, this ...

Note that I declare stringlen as an int rather than an unsigned long because that is what the library function is expecting.

... is very troubling. The iconv() function expects its third and fifth arguments to be of type size_t *, and lying to the compiler via a cast isn't going to make the code actually work if they are in fact different types. You should have something along these lines:

size_t in_bytes_left = (expression giving the total input length, in bytes);
size_t out_bytes_available = (expression giving the size of the output buffer);
char *input_temp = resbuff;
char *output_temp = outbuff;
int result;

result = iconv(conv, &input_temp, &in_bytes_left, &output_temp, &out_bytes_available);

Note, too, that you should check the return value to make sure the conversion was complete and successful (in which case the return value will be >= 0). If it is less than zero then the value of errno immediately after the call will tell you what kind of problem occurred.

Edited to add:

You originally said that the zero bytes were converted, but you now say that

outbuff and resbuff both end up as null strings.

which is not the same thing at all.

The iconv() function updates the pointers to the input and output buffers to facilitate converting a long input via multiple calls, the need for that being fairly common. That's why you must pass pointers to those pointers. If you don't want to lose the original values of these pointers then you should make and pass copies; I have updated my code above to demonstrate this.

Additionally, iconv() returns either an error indicator or a count of irreversibly-converted characters, not a count of the total number of converted characters. For valid UTF-16{,LE,BE} to UTF-8, there should never be any irreversible conversions. A return value of zero indicates that the specified number of input bytes were all successfully and reversibly converted to output bytes.

Note also that resbuff, at least, never was a C string. The null chars embedded in the data make a string interpretation inappropriate. Depending on how your input and output buffers were initialized, however, it could be that after iconv() finishes, *resbuff == '\0' and *outbuff == '\0' (referring to your own current code). I'd call those "empty" strings, by the way, not "null" strings. If you do really mean that iconv() leaves resbuff == 0 and outbuff == 0 (i.e. NULL pointers) then that would constitute a bug in iconv().