What i try to do:
Remove suspicious comments from html mails with bs4. Now i encountered a problem with so called conditional comments of type downlevel-revealed.
import bs4
html = 'A<!--[if expression]>a<![endif]-->' \
'B<![if expression]>b<![endif]>'
soup = bs4.BeautifulSoup(html, 'html5lib')
for comment in soup.find_all(text=lambda text: isinstance(text, bs4.Comment)):
comment.extract()
Befor extract comments:
'A',
'[if expression]>a<![endif]',
'B',
'[if expression]',
'b',
'[endif]',
After extract comments:
'A',
'B',
'b',
Problem:
The small b should also be removed. Problem is, bs4 detects first comment as one single comment object, but second is detected as 3 objects. Comment(if), NavigableString(b) and Comment(endif). Extraction just removes the both comment types. NavigableString with content 'b' remains in DOM.
Any solution to this?
After some time of reading about conditional comments i can understand why this is happening this way.
downlevel-hidden
downlevel-hiddenare basically written as normal comment<!-- ... -->. This is detected as conditional comment block in modern browsers. So BeautifulSoup removes it completely if i like to remove conditional comments.downlevel-revealed
downlevel-revealedare written as<!...>b<!...>, modern browsers detect the two tags as invalid and ignore them in DOM, so justbremains valid. So BeautifulSoup removes only the tags, not the contentConclusion
BeautifulSoup handles conditional comments as modern browsers would do. This is perfectly fine for my circumstances.