Algorithm to convert UTF8 4 byte string to a UTF16 string in C

81 views Asked by At

I have a requirement to convert a UTF8 4 byte string to a UTF16 string in C.
I am not allowed to use any external libraries to support it. I already have a macro defined to support the UTF8 3 byte to UTF16 conversion

#define UTF8-3BYTE-TO-UCS16(char1,char2,char3) ((((char1) & 0x0F) << 12) | (((char2) & 0x3F) << 6) | ((char3) & 0x3F))

I am looking for a similar implementation for the UTF8 4 byte as well.

2

There are 2 answers

2
str1ng On

UTF-8 encodes a Unicode char to 1-4 bytes. Basically, a UTF-8 4-Byte sequence structure is as following:

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx  

where x represents a bit of the actual Unicode char.

A UTF-8 4-Byte sequence is translated into UTF-16 as a pair of surrogate chars.

You can extract the Unicode code from the UTF-8 sequence, then check if the codepoint is within the range of BMP (Basic Multilingual Plane), and if it is then you can represent it with a single UTF-16 code unit, but if it's not then you do the calculation of high and low surrogates:

#define UTF8_4BYTE_TO_UTF16(char1, char2, char3, char4) \
    uint32_t codePoint = (((char1 & 0x07) << 18) | \
                          ((char2 & 0x3F) << 12) | \
                          ((char3 & 0x3F) << 6)  | \
                          (char4 & 0x3F)); \
    uint16_t highSurrogate, lowSurrogate; \
    if (codePoint <= 0xFFFF) { \
        /* BMP character, can be represented directly in UTF-16 */ \
        highSurrogate = (uint16_t)codePoint; \
    } else { \
        /* Calculate surrogates for non-BMP character */ \
        codePoint -= 0x10000; \
        highSurrogate = (uint16_t)((codePoint >> 10) + 0xD800); \
        lowSurrogate = (uint16_t)((codePoint & 0x3FF) + 0xDC00); \
    } \

Keep in mind that this macro is using intermediate 32-bit and 16-bit variables, and that you'll have to be sure that these are declared properly in your function, or you'll have to adjust the macro accordingly.

UPDATE 1: Mark Tolonen actually opened my eyes, I didn't think this through, basically he's right when he said that 4-byte UTF-8 sequence is always representing a code point beyond U+FFFF and thus it always requires surrogate pairs in UTF-16. Therefore the check in the code before was not needed.

#define UTF8_4BYTE_TO_UTF16(char1, char2, char3, char4) \
    uint32_t codePoint = (((char1 & 0x07) << 18) | \
                          ((char2 & 0x3F) << 12) | \
                          ((char3 & 0x3F) << 6)  | \
                          (char4 & 0x3F)); \
    uint16_t highSurrogate, lowSurrogate; \
    codePoint -= 0x10000; \
    highSurrogate = (uint16_t)((codePoint >> 10) + 0xD800); \
    lowSurrogate = (uint16_t)((codePoint & 0x3FF) + 0xDC00);
6
Mark Tolonen On

Here are separate macros to generate the HI/LO surrogates. Better to use a function so errors can be returned for invalid byte sequences or use an existing library for conversion like ICU.

#include <stdio.h>
#include <stdint.h>

#define UTF8_4BYTE_TO_UNICODE(char1, char2, char3, char4) ((((char1) & 0x07) << 18) | (((char2) & 0x3F) << 12) | (((char3) & 0x3F) << 6) | ((char4) & 0x3F))
#define UNICODE_TO_UTF16_HI(uni) ((((uni) - 0x10000) >> 10) + 0xD800)
#define UNICODE_TO_UTF16_LO(uni) ((((uni) - 0x10000) & 0x3FF) + 0xDC00)

int main()
{
    // U+1F50C ELECTRIC PLUG 
    uint32_t uni = UTF8_4BYTE_TO_UNICODE(0xf0, 0x9f, 0x94, 0x8c);
    uint16_t hi = UNICODE_TO_UTF16_HI(uni);
    uint16_t lo = UNICODE_TO_UTF16_LO(uni);
    printf("%04X %04X\n", hi, lo);
    return 0;
}

Output:

D83D DD0C

References (Wikipedia):