6 byte emojis from NSUTF8StringEncoding

1.9k views Asked by At

I am confused about the byte representation of an emoji encoded in UTF8. My understanding is that UTF8 characters are variable in size, up to 4 bytes.

When I encode the ❤️ emoji in UTF8 on iOS 13, I get back 6 bytes:

NSString* heartEmoji = @"❤️";
NSData* utf8 = [heartEmoji dataUsingEncoding:NSUTF8StringEncoding];
NSLog(@"%@", utf8); // {length = 6, bytes = 0xe29da4efb88f}

If I revert the operation, just consuming the first 3 bytes, I get a unicode heart back:

BYTE bytes[3] = { 0 };
[utf8 getBytes:bytes length:3];
NSString* decoded = [[NSString alloc] initWithBytes:bytes length:3 encoding:NSUTF8StringEncoding];
NSLog(@"%@", decoded); // ❤

Note that I use the heart as an example; I tried with many emoji and most are 4 bytes in UTF8, but some are 6.

Do I have some faulty assumptions about UTF8? What can I do to represent all emoji in 4 bytes as UTF8?

1

There are 1 answers

3
Rob Napier On BEST ANSWER

My understanding is that UTF8 characters are variable in size, up to 4 bytes.

This is not quite correct. A UTF8 code point is up to 4 bytes. But a character (specifically an extended grapheme cluster), can be much longer due to combining characters. Dozens of bytes at a minimum, and unlimited in the most extreme cases. See Why are emoji characters like 👩‍👩‍👧‍👦 treated so strangely in Swift strings? for an interesting example.

In your example, your emoji is HEAVY BACK HEART (U+2764) followed by VARIATION SELECTOR-16 (U+FE0F) which indicates that it should be red. UTF-8 requires three bytes to encode each of those code points.