Let's say I have something like this <data>some 'text'</data>
, expat has no problem parsing this.
Now if I have this: <data>'<some text>'</data>
it freaks out about a mismatched tag, which is due to <
being found.
Unfortunately I can't just escape all <
and >
because that will result in not well-formed
since there is no longer a start tag. Is there a simple way to get around this? The only way I can think is making a regular expression to escape <
and >
if they are found within a quote.
EDIT: The actual portion that breaks it:
<script type='text/javascript'>
(function() {
var useSSL = 'https:' == document.location.protocol;
var src = (useSSL ? 'https:' : 'http:') +
'//www.googletagservices.com/tag/js/gpt.js';
document.write('<scr' + 'ipt src="' + src + '"></scr' + 'ipt>');
})();
</script>
Assuming your bad (X)HTML is all consistent with this example, the rule seems pretty obvious: You want to treat
script
tags as if they werecdata
. That isn't valid, but that gives you something relatively simple that you can write and apply to your page before parsing it. You could either cdata-fy thescript
body, quote angle brackets within the script body, or whatever else you find appropriate. Then you'll have valid markup (or maybe you'll just have the next error to deal with) that you can successfully parse. (Without knowing what you're trying to do with the data beyond parsing, most likely nobody can suggest anything too much more specific.)The rule you suggested, "making a regular expression to escape
<
and>
if they are found within a quote", is clearly not going to work. Consider how this would affect these two fragments:And that's even besides the issue that, even if the language you're suggesting were not ambiguous, it still wouldn't be a regular language.
Also, it's worth asking whether this is actually XML in the first place. If it's XHTML, it's got additional problems—e.g.,
document.write
does not exist in the XHTML DOM. It might be the XML serialization profile for HTML5, but it might just be HTML5 or HTML 4.01, in which case you shouldn't be trying to parse it as XML in the first place.You may also want to consider using a more liberal parser. Trying
beautifulsoup4
with each of the parsers it knows how to use (lxml
in XML, HTML mode, and HTML5 mode, andhtml.parser
, andhtml5lib
) until you find one that works consistently can be a good quick&dirty solution to broken markup.