How to preserve inline CSS style with lxml.html.clean.Cleaner() in Python?

1.9k views Asked by At

I am trying to clean up an HTML table using lxml.html.clean.Cleaner(). I need to strip JavaScript attributes, but would like to preserve inline CSS style. I thought style=False is the default setup:

import lxml.html.clean
cleaner = lxml.html.clean.Cleaner()

however when I call cleaner.clean_html(doc)

<span style="color:#008800;">67.51</span>

will become

<span>67.51</span>

Basically, style is not preserved. I tried to add:

cleaner.style= False

It doesn't help.

Update: I am using Python 2.6.6 + lxml 3.2.4 on Dreamhost, and Python 2.7.5 + lxml 3.2.4 on local Macbook. Same results. Another thing: there is a javacript-related attribute in my html:

<td style="cursor:pointer;">Ticker</td>

Could it be lxml stripped this JavaScript related style and treated other styles the same? I hope not.

1

There are 1 answers

2
mzjn On BEST ANSWER

It works if you set cleaner.safe_attrs_only = False.

The set of "safe" attributes (Cleaner.safe_attrs) is defined in the lxml.html.defs module (source code) and style is not included in the set.

But even better than cleaner.safe_attrs_only = False is to use Cleaner(safe_attrs=lxml.html.defs.safe_attrs | set(['style'])). This will preserve style and at the same time protect from other unsafe attributes.

Demo code:

from lxml import html
from lxml.html import clean

s ='<marquee><span style="color: #008800;">67.51</span></marquee>'
doc = html.fromstring(s)
cleaner = clean.Cleaner(safe_attrs=html.defs.safe_attrs | set(['style']))

print html.tostring(cleaner.clean_html(doc))

Output:

<div><span style="color: #008800;">67.51</span></div>