In my XML input file I have the following line:
<change beforeWhat="Literacy rate in L2: 50\%–75\%. Informally used" />
That character between 50\%
and 75\%
is not a hyphen but an en dash.
When I parse in this XML file using expat in Python:
postFixesDoc = minidom.parse('postFixes.xml')
I get the following error:
ExpatError: not well-formed (invalid token): line 35, column 99
where 35 is the line I quoted above from the XML input file, and 99 is the column of the %
right before the en dash.
If I replace the en dash with –
, then the error goes away and everything works fine.
So I have a workaround. But I don't understand why this is happening.
What I've read about the problem -- e.g. Python’s minidom, xml, and illegal unicode characters -- tells me that some characters that are legal in UTF-8 aren't legal in XML, and points me to section 2.2 of the XML Spec on legal character ranges. But the definition for Char there includes the range #x20-#xD7FF
. And #x2013
obviously falls within that range. So what's the problem?
FWIW, the XML input file begins with a UTF-8 declaration,
<?xml version="1.0" encoding="utf8"?>
and I used a hex editor to verify that the en dash is represented by the character sequence E2 80 93, which is the correct UTF-8 encoding for en dash. So why won't expat accept it? Is this a bug in expat?
Aha...
This Python doc footnote, though it applies to a different situation, alerted me to the fact that my XML encoding declaration was wrong:
For some reason I was under the impression that
utf8
was acceptable too. But when I changed the declaration tothe error went away!