using html5lib with xml.etree.ElementTree

1.2k views Asked by At

I need is a way to use the html5lib parser to generate a real xml.etree.ElementTree. (lxml is not an option for portability reasons.)

ELementTree.parse can take a parser as an optional parameter

xml.etree.ElementTree.parse(source, parser=None)

but it's not clear what such a parser would look like. Is there a class or object within HTML5 I could use for the parser argument? Documentation for both libraries on this issue is thin.


Context:

I have a malformed XHTML file that can't be parsed with ElementTree.parse:

<?xml version="1.0" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Title</title></head>
<body><div class="cls">Note that this br<br>is missing a closing slash</div></body>
</html>

So I used html5lib.parse instead with the default treebuilder="etree" parameter, which worked fine.

But html5lib apparently does not output an xml.etree.ElementTree object, just one with a near-identical API. There are two problems with this:

  • html5lib's find does not support the namespaces parameter, making XPath excessively verbose without a clumsy wrapper function.
  • The Eclipse debugger does not support drill-through of html5lib etrees.

So I cannot use either ElementTree or html5lib alone.

2

There are 2 answers

1
gsnedders On BEST ANSWER

Given xml.etree.ElementTree as etree (as it is commonly imported as):

What's returned is not an etree.ElementTree, but rather an etree.Element (this is the same as what etree.fromstring returns; only etree.parse returns an etree.ElementTree). It is genuinely part of the etree module — it's not something with a similar API. The problem you've run into applies to etree.fromstring as much as it does html5lib.

The Python documentation for xml.etree.ElementTree doesn't mention the namespaces argument — it seems to be an undocumented feature of ElementTree objects (but not Element objects). As such, it's probably not something that should really be relied on! Your best bet is likely going to be to use a wrapper function.

The fact that Eclipse cannot go through the trees is down to the fact that html5lib defaults to xml.etree.cElementTree when it exists — which is meant to be identical, per the module's documentation, but is implemented in C using CPython's API, stopping Eclipse's debugger from functioning. You can get a treebuilder using the non-accelerated version (note from Python 3.3 both are the C implementation — cElementTree merely survives as a deprecated alias) using the below:

import xml.etree.ElementTree as etree
import html5lib

tb = html5lib.getTreeBuilder("etree", implementation=etree)
p = html5lib.HTMLParser(tb)
tree = p.parse("<html>")
0
reubano On

You have to wrap the response in an ElementTree

>>> from xml.etree.ElementTree import ElementTree
>>> from html5lib import parse
>>>
>>> ElementTree(parse("<html>"))
<xml.etree.ElementTree.ElementTree at 0x...>