I have an XML file that states it's using utf-8. When I open the file in VIM, I see something like
<?xml version="1.0" encoding="UTF-8"?>
<r>
<first-tag>foo</first-tag>
<second-tag>
<a-tag-nested-in-second-tag>some data</a-tag-nested-in-second-tag>
</second-tag>
...
</r>
I'm using Java 1.6.0_41's SAXParser and while consuming this data, the parser basically doesn't see the malformed literals and skips over them or seems to treat the malformed chars as "content" data for second-tag
.
Here's how I'm consuming data,
File f = ...
SAXParser parser = SAXParserFactory.newInstance().newSAXParser();
stream = new FileInputStream(f);
AbstractHandler handler = ...
parser.parse(new InputSource(stream), handler);
Is there a way for SAX to treat the nested escaped XML data as truly XML markup and not merely data as-is for second-tag
?
UTF-8 is a character encoding. It wouldn't make sense to have multiple character encodings in a single file, nor do you show any evidence of having multiple character encodings.
What you do show are multiple character entity references such as
<
and>
. These are not a problem, although they may indicate (intentional or accidental) escaped output of XML markup.What is a problem is that your "XML" lacks a single root element and is therefore not well-formed.
If you give your markup a single root element,
an XML parser will be able to parse it just fine.
Update per comments and updated question
No, there's not a simple configuration flag that'll direct SAX to treat escaped XML as regular XML. SAX will rightly see the escaped XML data as the characters and character entity references that it is. Your options include fixing the problem upstream by
Note that option #2 might itself involve a SAX-based parser whose entity handlers you've designed to rebuild the original XML.
See also how to unescape XML in java.