How to parse XML with multiple character entities?

465 views Asked by At

I have an XML file that states it's using utf-8. When I open the file in VIM, I see something like

<?xml version="1.0" encoding="UTF-8"?> 
<r>
  <first-tag>foo</first-tag>
  <second-tag>
     &lt;a-tag-nested-in-second-tag&gt;some data&lt;/a-tag-nested-in-second-tag&gt;
  </second-tag>
  ...
</r>

I'm using Java 1.6.0_41's SAXParser and while consuming this data, the parser basically doesn't see the malformed literals and skips over them or seems to treat the malformed chars as "content" data for second-tag.

Here's how I'm consuming data,

File f = ...
SAXParser parser = SAXParserFactory.newInstance().newSAXParser();
stream = new FileInputStream(f);
AbstractHandler handler = ...
parser.parse(new InputSource(stream), handler);

Is there a way for SAX to treat the nested escaped XML data as truly XML markup and not merely data as-is for second-tag?

1

There are 1 answers

3
kjhughes On BEST ANSWER

UTF-8 is a character encoding. It wouldn't make sense to have multiple character encodings in a single file, nor do you show any evidence of having multiple character encodings.

What you do show are multiple character entity references such as &lt; and &gt;. These are not a problem, although they may indicate (intentional or accidental) escaped output of XML markup.

What is a problem is that your "XML" lacks a single root element and is therefore not well-formed.

If you give your markup a single root element,

<?xml version="1.0" encoding="UTF-8"?>
<r>
  <first-tag>foo</first-tag>
  <second-tag>
    &lt;a-tag-nested-in-second-tag&gt;some data&lt;/a-tag-nested-in-second-tag&gt;
  </second-tag>
</r>

an XML parser will be able to parse it just fine.


Update per comments and updated question

Is there a way for SAX to treat the nested escaped xml data as truly xml markup and not merely data as-is for "second-tag"?

No, there's not a simple configuration flag that'll direct SAX to treat escaped XML as regular XML. SAX will rightly see the escaped XML data as the characters and character entity references that it is. Your options include fixing the problem upstream by

  1. eliminating the escaping of the XML you wish to preserve, or
  2. post-processing the escaped XML data to re-establish the original XML.

Note that option #2 might itself involve a SAX-based parser whose entity handlers you've designed to rebuild the original XML.

See also how to unescape XML in java.