How to remove links from HTML completely with Bleach?

348 views Asked by At

Bleach strips non-whitelisted tags from HTML, but leaves child nodes, e.g.

>>> import bleach
>>> bleach.clean("<a href="">stays</a>", strip=True, tags=[])
'stays'
>>>  

How can the entire element along with its children be removed?

1

There are 1 answers

0
markwalker_ On

You should use lxml. Bleach is simply for cleaning data & ensuring security/safety in the markup you store.

You can use lxml to parse structured data like HTML or XML.

Consider a simple html file;

<html>
<body>
<p>Hello, World!</p>
</body>
</html>
from lxml import html

root = html.parse("hello_world.html").getroot()

print(html.tostring(root))

# <html><body><p>Hello, World!</p></body></html>

p = root.find("body/p")

p.drop_tree()

print(html.tostring(root))

# <html><body></body></html>

On a related note, if you want to look into some more advanced parsing with lxml, one of my oldest questions on here was around getting python to parse xml & write python code out of it. Writing a Python tool to convert XML to Python?