I have a requirement to convert a UTF8 4 byte string to a UTF16 string in C.
I am not allowed to use any external libraries to support it. I already have a macro defined to support the UTF8 3 byte to UTF16 conversion
#define UTF8-3BYTE-TO-UCS16(char1,char2,char3) ((((char1) & 0x0F) << 12) | (((char2) & 0x3F) << 6) | ((char3) & 0x3F))
I am looking for a similar implementation for the UTF8 4 byte as well.
UTF-8 encodes a Unicode char to 1-4 bytes. Basically, a UTF-8 4-Byte sequence structure is as following:
where
xrepresents a bit of the actual Unicode char.A UTF-8 4-Byte sequence is translated into UTF-16 as a pair of surrogate chars.
You can extract the Unicode code from the UTF-8 sequence, then check if the codepoint is within the range of BMP (Basic Multilingual Plane), and if it is then you can represent it with a single UTF-16 code unit, but if it's not then you do the calculation of high and low surrogates:
Keep in mind that this macro is using intermediate 32-bit and 16-bit variables, and that you'll have to be sure that these are declared properly in your function, or you'll have to adjust the macro accordingly.
UPDATE 1: Mark Tolonen actually opened my eyes, I didn't think this through, basically he's right when he said that 4-byte UTF-8 sequence is always representing a code point beyond U+FFFF and thus it always requires surrogate pairs in UTF-16. Therefore the check in the code before was not needed.