Linked Questions

Popular Questions

I am struggling with the following issue: I have an XML string that contains the following tag and I want to convert this, using cElementTree, to a valid XML document:

<tag>#55296;#57136;#55296;#57149;#55296;#57139;#55296;#57136;#55296;#57151;#55296;
#57154;#55296;#57136;</tag>

but each # sign is preceded by a & sign and hence the output looks like: ��������������

This is a unicode string and the encoding is UTF-8. I want to discard these numeric character references because they are not legal XML in a valid XML document (see Parser error using Perl XML::DOM module, "reference to invalid character number")

I have tried different regular expression to match these numeric character references. For example, I have tried the following (Python) regex:

RE_NUMERIC_CHARACTER = re.compile('&#[\d{1,5}]+;')

This does work in regular python session but as soon as I use the same regex in my code then it doesn't work, presumably because those numeric characters have been interpreted (and are shown as boxes or question marks).

I have also tried the unescape function from http://effbot.org/zone/re-sub.htm but that does not work either.

Thus: how can I match, using a regular expression in Python, these numeric character references and create a valid XML document?

Related Questions