I'm rather surprised that lxml.html leaves insignificant whitespace when parsing HTML by default. I'm also surprised that I can't find any obvious way to make it not do that.
Python 2.7.3 (default, Apr 10 2013, 06:20:15)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import lxml.etree
>>> parser = lxml.etree.HTMLParser(remove_blank_text=True)
>>> html = lxml.etree.HTML("<p> Hello World </p>", parser=parser)
>>> print lxml.etree.tostring(html)
<html><body><p> Hello World </p></body></html>
I expect the result would be something like:
>>> print lxml.etree.tostring(html)
<html><body><p>Hello World</p></body></html>
BeautifulSoup4 does the same thing with the html5lib parser:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<p> Hello World </p>", "html5lib")
>>> soup.p
<p> Hello World </p>
After doing some research, I found that the HTML5 parsing specification does not specify to remove consecutive whitespace; that is done at render time instead. So I understand that's it technically not the responsibility of any of these libraries to perform the same behavior, but it seems useful enough that I'm surprised none of them have it anyway.
Can somebody prove me wrong?
Edit:
I know how to remove whitespace using a regex — that was not my question. (I also know how to search SO for questions about regex.)
My question has to do with the insignificant whitespace, where significance is defined by the standards for rendering HTML. I doubt that a 1-liner regex can correctly implement this standard. And let's not even delve into the regex vs CFG debate again, please?
RegEx match open tags except XHTML self-contained tags
Edit 2:
In case it's not clear from the context, I am interested in HTML, not XHTML/XML. Whitespace does have some non-trivial rules of significance in HTML, however those rules are implemented in the renderer, not the parser. I understand that, as evidenced in my initial post. My question is whether anybody has implemented the white space logic of an HTML renderer in a library that operates at the DOM level rather than at the rendering level?
I came across this library.
Can be installed with pip:
It's used like:
Which returns:
I thought it would do what you were looking for, but as you see, some irrelevant spaces were kept.