I am trying to clean up an HTML table using lxml.html.clean.Cleaner(). I need to strip JavaScript attributes, but would like to preserve inline CSS style. I thought style=False is the default setup:
import lxml.html.clean
cleaner = lxml.html.clean.Cleaner()
however when I call cleaner.clean_html(doc)
<span style="color:#008800;">67.51</span>
will become
<span>67.51</span>
Basically, style is not preserved. I tried to add:
cleaner.style= False
It doesn't help.
Update: I am using Python 2.6.6 + lxml 3.2.4 on Dreamhost, and Python 2.7.5 + lxml 3.2.4 on local Macbook. Same results. Another thing: there is a javacript-related attribute in my html:
<td style="cursor:pointer;">Ticker</td>
Could it be lxml stripped this JavaScript related style and treated other styles the same? I hope not.
It works if you set
cleaner.safe_attrs_only = False
.The set of "safe" attributes (
Cleaner.safe_attrs
) is defined in thelxml.html.defs
module (source code) andstyle
is not included in the set.But even better than
cleaner.safe_attrs_only = False
is to useCleaner(safe_attrs=lxml.html.defs.safe_attrs | set(['style']))
. This will preservestyle
and at the same time protect from other unsafe attributes.Demo code:
Output: