Why do DocBook generated XHTML5 Section titles have ASCII #c2 characters in them?

342 views Asked by At

I noticed my generated XHTML5 numbered section titles have a  between the number and the title string. I thought this was a generation error. But no, the gentext file of my DocBook distribution, common/en.xml, actually specifies this.

Line 338 of common/en.xml:

<l:template name="section" text="%n. %t"/>

The dot and space following the %n are, when viewed in a hex editor, ASCII character codes C2 and A0, which are the  and NBSP characters respectively. I can understand NBSP. But why the �

I understand I can change this in my customization layer. But the default seems odd.

I'm using docbook-xsl-ns-1.77.1.

1

There are 1 answers

0
Jeremy Griffith On

That is because the encoding is UTF-8, which is the normal Unicode encoding for text these days. In UTF-8, any character above 0x7F is represented by a sequence of 2, 3, or 4 bytes depending on how many significant code bits it contains.

The 0xC2 is one of the chars that starts a 2-byte sequence. In binary, it's 1100 0010. The two 1 bits denote a 2-char sequence, and the bottom five bits are the first five of the encoded character. The second one, 0xA0, is 1001 0000. The single leading 1 bit (followed by a 0 bit) denotes a continuation of the sequence, and the bottom 6 bits are the bottom bits of the encoded character.

Putting the bottom five bits from the first byte together with the bottom six bits from the second, we get 000 1001 0000, in hex U+A0, which is indeed the nonbreaking space.