I need is a way to use the html5lib parser to generate a real xml.etree.ElementTree. (lxml is not an option for portability reasons.)
ELementTree.parse
can take a parser as an optional parameter
xml.etree.ElementTree.parse(source, parser=None)
but it's not clear what such a parser would look like. Is there a class or object within HTML5 I could use for the parser
argument? Documentation for both libraries on this issue is thin.
Context:
I have a malformed XHTML file that can't be parsed with ElementTree.parse
:
<?xml version="1.0" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Title</title></head>
<body><div class="cls">Note that this br<br>is missing a closing slash</div></body>
</html>
So I used html5lib.parse
instead with the default treebuilder="etree"
parameter, which worked fine.
But html5lib apparently does not output an xml.etree.ElementTree
object, just one with a near-identical API. There are two problems with this:
- html5lib's
find
does not support thenamespaces
parameter, making XPath excessively verbose without a clumsy wrapper function. - The Eclipse debugger does not support drill-through of html5lib etrees.
So I cannot use either ElementTree or html5lib alone.
Given
xml.etree.ElementTree
asetree
(as it is commonly imported as):What's returned is not an
etree.ElementTree
, but rather anetree.Element
(this is the same as whatetree.fromstring
returns; onlyetree.parse
returns anetree.ElementTree
). It is genuinely part of the etree module — it's not something with a similar API. The problem you've run into applies toetree.fromstring
as much as it does html5lib.The Python documentation for
xml.etree.ElementTree
doesn't mention thenamespaces
argument — it seems to be an undocumented feature ofElementTree
objects (but notElement
objects). As such, it's probably not something that should really be relied on! Your best bet is likely going to be to use a wrapper function.The fact that Eclipse cannot go through the trees is down to the fact that html5lib defaults to
xml.etree.cElementTree
when it exists — which is meant to be identical, per the module's documentation, but is implemented in C using CPython's API, stopping Eclipse's debugger from functioning. You can get a treebuilder using the non-accelerated version (note from Python 3.3 both are the C implementation —cElementTree
merely survives as a deprecated alias) using the below: