The HTML 4.01 spec says for hexadecimal character references
Numeric character references specify the code position of a character in the document character set.
So if the document character set encoding is UTF-8, the numeric references should specify a Unicode code point.
The HTML5 spec says for hexadecimal character references
The ampersand must be followed by a U+0023 NUMBER SIGN character (#), which must be followed by either a U+0078 LATIN SMALL LETTER X character (x) or a U+0058 LATIN CAPITAL LETTER X character (X), which must then be followed by one or more digits in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), U+0061 LATIN SMALL LETTER A to U+0066 LATIN SMALL LETTER F, and U+0041 LATIN CAPITAL LETTER A to U+0046 LATIN CAPITAL LETTER F, representing a base-sixteen integer that corresponds to a Unicode code point that is allowed according to the definition below. The digits must then be followed by a U+003B SEMICOLON character (;).
No mention is made of the document character set, and it simply says that the numeric value identifies a Unicode code point.
But it seems that all the modern browsers (I haven't tested older ones) treat € through Ÿ as if they were referencing Windows-1252
For example, € displays €
, but U+0080 isn't the code point for €
, U+20AC is. And the Unicode code point for U+0080 is defined as PAD
€ also (correctly) displays €
.
Is this simply pragmatic behaviour by browsers or is there a justification in a specification that I'm missing?
[Note that decimal character references have the same behaviour. I've just used the hexadecimal ones for clarity and consistency.]
I found the answer to my question. It's in the tokenization section of the parsing algorithm in HTML5 for consume a character reference, which defines the mapping for these characters.