How to remove insignificant whitespace in lxml.html?

3.1k views Asked by At

I'm rather surprised that lxml.html leaves insignificant whitespace when parsing HTML by default. I'm also surprised that I can't find any obvious way to make it not do that.

Python 2.7.3 (default, Apr 10 2013, 06:20:15) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import lxml.etree
>>> parser = lxml.etree.HTMLParser(remove_blank_text=True)
>>> html = lxml.etree.HTML("<p>      Hello     World     </p>", parser=parser)
>>> print lxml.etree.tostring(html)
<html><body><p>      Hello     World     </p></body></html>

I expect the result would be something like:

>>> print lxml.etree.tostring(html)
<html><body><p>Hello World</p></body></html>

BeautifulSoup4 does the same thing with the html5lib parser:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<p>      Hello     World     </p>", "html5lib")
>>> soup.p
<p>      Hello     World     </p>

After doing some research, I found that the HTML5 parsing specification does not specify to remove consecutive whitespace; that is done at render time instead. So I understand that's it technically not the responsibility of any of these libraries to perform the same behavior, but it seems useful enough that I'm surprised none of them have it anyway.

Can somebody prove me wrong?

Edit:

I know how to remove whitespace using a regex — that was not my question. (I also know how to search SO for questions about regex.)

My question has to do with the insignificant whitespace, where significance is defined by the standards for rendering HTML. I doubt that a 1-liner regex can correctly implement this standard. And let's not even delve into the regex vs CFG debate again, please?

RegEx match open tags except XHTML self-contained tags

Edit 2:

In case it's not clear from the context, I am interested in HTML, not XHTML/XML. Whitespace does have some non-trivial rules of significance in HTML, however those rules are implemented in the renderer, not the parser. I understand that, as evidenced in my initial post. My question is whether anybody has implemented the white space logic of an HTML renderer in a library that operates at the DOM level rather than at the rendering level?

2

There are 2 answers

4
Ivan Chaer On BEST ANSWER

I came across this library.

Can be installed with pip:

pip install htmlmin

It's used like:

from htmlmin import minify
html=u"<html><body><p>      Hello     World     </p></body></html>"
minified_html = minify(html)
print minified_html

Which returns:

<html><body><p> Hello World </p></body></html>

I thought it would do what you were looking for, but as you see, some irrelevant spaces were kept.

0
Wjars On

Ok. You would like to detect some whitespaces, and get away those in excess.

You can do it with a reg-exp.

from re import sub
sub(r"(\s)+",' ',yourstring)

it'll replace all adjacent whitespaces (when more than one) by one and only one of them

'<p> Hello World </p>'

was my result with this.

I suppose it's close enough to your expectations, and a lone whitespace is always better for readability than none.

With a bit longer regular expression, you should manage to get away whitespaces adjacent to HTML tags.