Why does expat reject en dash character as invalid?

2k views Asked by At

In my XML input file I have the following line:

<change beforeWhat="Literacy rate in L2: 50\%–75\%. Informally used" />

That character between 50\% and 75\% is not a hyphen but an en dash.

When I parse in this XML file using expat in Python:

postFixesDoc = minidom.parse('postFixes.xml')

I get the following error:

ExpatError: not well-formed (invalid token): line 35, column 99             

where 35 is the line I quoted above from the XML input file, and 99 is the column of the % right before the en dash.

If I replace the en dash with &#x2013;, then the error goes away and everything works fine. So I have a workaround. But I don't understand why this is happening.

What I've read about the problem -- e.g. Python’s minidom, xml, and illegal unicode characters -- tells me that some characters that are legal in UTF-8 aren't legal in XML, and points me to section 2.2 of the XML Spec on legal character ranges. But the definition for Char there includes the range #x20-#xD7FF. And #x2013 obviously falls within that range. So what's the problem?

FWIW, the XML input file begins with a UTF-8 declaration,

<?xml version="1.0" encoding="utf8"?>

and I used a hex editor to verify that the en dash is represented by the character sequence E2 80 93, which is the correct UTF-8 encoding for en dash. So why won't expat accept it? Is this a bug in expat?

2

There are 2 answers

0
LarsH On BEST ANSWER

Aha...

This Python doc footnote, though it applies to a different situation, alerted me to the fact that my XML encoding declaration was wrong:

The encoding string included in XML output should conform to the appropriate standards. For example, “UTF-8” is valid, but “UTF8” is not.

For some reason I was under the impression that utf8 was acceptable too. But when I changed the declaration to

<?xml version="1.0" encoding="utf-8"?>

the error went away!

0
TextGeek On

Glad fixing the encoding helped! In general, a useful trick with encoding issues is to transform all non-ASCII characters to numeric character references (like the "&#x2013;" you tried). If that fixes it then the problem is almost certainly at the encoding level, at which point you start figuring whether your data is really in UCS-2, UTF-8, CP1252 (CP1252 is a common issue with curly-quotes and em/en dashes, though happily, you didn't get bitten by that one).

The *nix "iconv" utility can translate between zillions of character encodings. If you ask it to translate your data from (say) utf8 to ucs2, it will scream about any invalid byte sequences.

XML adds one more complication: many control characters (d00 - d31, other than CR, LF, and HT) are strictly not allowed. But an XML parser worth its salt will tell you if it sees those.