I'm trying to 'defrontpagify' the html of a MS FrontPage generated website, and I'm writing a BeautifulSoup script to do it.
However, I've gotten stuck on the part where I try to strip a particular attribute (or list attributes) from every tag in the document that contains them. The code snippet:
REMOVE_ATTRIBUTES = ['lang','language','onmouseover','onmouseout','script','style','font',
'dir','face','size','color','style','class','width','height','hspace',
'border','valign','align','background','bgcolor','text','link','vlink',
'alink','cellpadding','cellspacing']
# remove all attributes in REMOVE_ATTRIBUTES from all tags,
# but preserve the tag and its content.
for attribute in REMOVE_ATTRIBUTES:
for tag in soup.findAll(attribute=True):
del(tag[attribute])
It runs without error, but doesn't actually strip any of the attributes. When I run it without the outer loop, just hard coding a single attribute (soup.findAll('style'=True), it works.
Anyone see know the problem here?
PS - I don't much like the nested loops either. If anyone knows a more functional, map/filter-ish style, I'd love to see it.
The line
does not find any
tag
s. There might be a way to usefindAll
, I'm not sure.However, this works (as of beautifulsoup 4.8.1):
This is previous code that may have worked with an older version of beautifulsoup:
Note this this code will only work in Python 3. If you need it to work in Python 2, see Nóra's answer below.