Expat unable to parse when '<' or '>' found in text section

151 views Asked by At

Let's say I have something like this <data>some 'text'</data>, expat has no problem parsing this.
Now if I have this: <data>'<some text>'</data> it freaks out about a mismatched tag, which is due to < being found.

Unfortunately I can't just escape all < and > because that will result in not well-formed since there is no longer a start tag. Is there a simple way to get around this? The only way I can think is making a regular expression to escape < and > if they are found within a quote.

EDIT: The actual portion that breaks it:

<script type='text/javascript'>
(function() {
var useSSL = 'https:' == document.location.protocol;
var src = (useSSL ? 'https:' : 'http:') +
'//www.googletagservices.com/tag/js/gpt.js';
document.write('<scr' + 'ipt src="' + src + '"></scr' + 'ipt>');
})();
</script>
1

There are 1 answers

15
abarnert On BEST ANSWER

Assuming your bad (X)HTML is all consistent with this example, the rule seems pretty obvious: You want to treat script tags as if they were cdata. That isn't valid, but that gives you something relatively simple that you can write and apply to your page before parsing it. You could either cdata-fy the script body, quote angle brackets within the script body, or whatever else you find appropriate. Then you'll have valid markup (or maybe you'll just have the next error to deal with) that you can successfully parse. (Without knowing what you're trying to do with the data beyond parsing, most likely nobody can suggest anything too much more specific.)


The rule you suggested, "making a regular expression to escape < and > if they are found within a quote", is clearly not going to work. Consider how this would affect these two fragments:

<div id='normal'>Here is some '<div id='quoted'>quoted</div>' text</div>
<div id='normal'>Here's some '<div id='quoted'>quoted</div>' text</div>

And that's even besides the issue that, even if the language you're suggesting were not ambiguous, it still wouldn't be a regular language.


Also, it's worth asking whether this is actually XML in the first place. If it's XHTML, it's got additional problems—e.g., document.write does not exist in the XHTML DOM. It might be the XML serialization profile for HTML5, but it might just be HTML5 or HTML 4.01, in which case you shouldn't be trying to parse it as XML in the first place.


You may also want to consider using a more liberal parser. Trying beautifulsoup4 with each of the parsers it knows how to use (lxml in XML, HTML mode, and HTML5 mode, and html.parser, and html5lib) until you find one that works consistently can be a good quick&dirty solution to broken markup.