Following is from the Visual Studio's C# Interactive Compiler:

> BitConverter.ToString(Encoding.BigEndianUnicode.GetBytes(""))
"D8-3D-DE-00"
> BitConverter.ToString(Encoding.BigEndianUnicode.GetBytes(""))
"D8-3C-DF-F4"
> BitConverter.ToString(Encoding.BigEndianUnicode.GetBytes(""))
"D8-3D-DE-00-D8-3C-DF-F4-DB-40-DC-67-DB-40-DC-62-DB-40-DC-65-DB-40-DC-6E-DB-40-DC-67-DB-40-DC-7F"

Emoji smiley's code units are a surrogate pair as expected - "D8-3D-DE-00"

Emoji flag's code units are a surrogate pair as expected - "D8-3C-DF-F4"

Given that, shouldn't the code units of emoji smiley followed by emoji flag have been - "D8-3D-DE-00-D8-3C-DF-F4"?

1

There are 1 answers

5
JosefZ On

The latter isn't a simple black flag emoji but a Emoji Tag Sequence:

Flag: England

Emoji Meaning: The flag for England, a country in the United Kingdom. May show as the letters gbeng.

The Flag: England emoji is a tag sequence combining Black Flag, Tag Latin Small Letter G, Tag Latin Small Letter B, Tag Latin Small Letter E, Tag Latin Small Letter N, Tag Latin Small Letter G and Cancel Tag. These display as a single emoji on supported platforms.

Flag: England was added to Emoji 5.0 in 2017.

I have written PowerShell cmdlet Get-CharInfo formerly and here's the result for your string (the column CodePoint contains Unicode (U+hhhh) and UTF-8 bytes, the column Description contains a surrogate pair if any):

 ""      | Get-CharInfo

Char CodePoint                      Category Description
---- ---------                      -------- -----------
   {U+1F600, 0xF0,0x9F,0x98,0x80} So       GRINNING FACE (0xd83d,0xde00)
   {U+1F3F4, 0xF0,0x9F,0x8F,0xB4} So       WAVING BLACK FLAG (0xd83c,0xdff4)
   {U+E0067, 0xF3,0xA0,0x81,0xA7} Cf       TAG LATIN SMALL LETTER G (0xdb40,0xdc67)
   {U+E0062, 0xF3,0xA0,0x81,0xA2} Cf       TAG LATIN SMALL LETTER B (0xdb40,0xdc62)
   {U+E0065, 0xF3,0xA0,0x81,0xA5} Cf       TAG LATIN SMALL LETTER E (0xdb40,0xdc65)
   {U+E006E, 0xF3,0xA0,0x81,0xAE} Cf       TAG LATIN SMALL LETTER N (0xdb40,0xdc6e)
   {U+E0067, 0xF3,0xA0,0x81,0xA7} Cf       TAG LATIN SMALL LETTER G (0xdb40,0xdc67)
   {U+E007F, 0xF3,0xA0,0x81,0xBF} Cf       CANCEL TAG (0xdb40,0xdc7f)