I have a HTML file which contains Chinese text. When I open the file in any web browser, there are characters which appear to be missing.
Here's an example copied from the browser window:
本函旨在邀請您參�� 定於
I know for a fact that all other characters seen here are correct aside from the missing ones (confirmed by a native Chinese speaker).
In the HTML header, I have a tag which signifies the file contains UTF-8 encoded characters:
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
I've already tried some other charsets in this META tag, but so far it seems any encoding method I try aside from UTF-8 ends up looking worse.
I also considered the possibility that it is a font issue, so I installed 3 different traditional Chinese fonts on my system and forced Chrome to use them. None of them made any difference - missing characters were still present.
If I open the HTML file with Notepad++, here's what I can see:
https://i.stack.imgur.com/Ex3C1.png
If I select and copy-paste this text into regular MS Notepad, I get this:
本函旨在邀請您參劦nbsp;定於
So you can see here that the "xE5 x8A" visible in Notepad++ seems to have been replaced by 劦.
Is there any reason why the browser would be showing �� instead of 劦 in this scenario?
Look again at the HTML file.
I see the first 2 bytes of a character encoded in UTF-8, followed by ... let's imagine there was originally a \xA0, and this was mutated to
when the file was created by applying global substitutions to the UTF-8-encoded data.However, \xE5\x8A\xA0 UTF-8 decodes to U+52A0 which is not the same as the alien character which is U+52A6 ... not close enough to an answer.