How to keep all html elements with selector but drop all others?

Question

How to keep all html elements with selector but drop all others?

66 views Asked by Wuff At 13 September 2021 at 12:34

I would like to get a HTML string without certain elements. However, upfront I just know which elements to keep but don't know which ones to drop.

Let's say I just want to keep all p and a tags inside the div with class="A".

Input:

<div class="A">
  <p>Text1</p>
  <img src="A.jpg">
  <div class="sub1">
    <p>Subtext1</p>
  </div>
  <p>Text2</p>
  <a href="url">link text</a>
</div>
<div class="B">
  ContentDiv2
</div>

Expected output:

<div class="A">
  <p>Text1</p>
  <p>Text2</p>
  <a href="url">link text</a>
</div>

If I'd know all the selectors of all other elements I could just use lxml's drop_tree(). But the problem is that I don't know ['img', 'div.sub1', 'div.B'] upfront.

Example with drop_tree():

import lxml.cssselect
import lxml.html

tree = lxml.html.fromstring(html_str)

elements_drop = ['img', 'div.sub1', 'div.B']
for j in elements_drop:
    selector = lxml.cssselect.CSSSelector(j)
    for e in selector(tree):
        e.drop_tree()

output = lxml.html.tostring(tree)

Original Q&A

There are 2 answers

balderman On 13 September 2021 at 15:05

Try the below. The idea is to clean the root and add the required sub elements.

Note that no external lib is required.

import xml.etree.ElementTree as ET

html = '''<div class="A">
  <p>Text1</p>
  <img src="A.jpg"/>
  <div class="sub1">
    <p>Subtext1</p>
  </div>
  <p>Text2</p>
  <a href="url">link text</a>
  ContentDiv2
</div>'''
root = ET.fromstring(html)
p_lst = root.findall('./p')
a_lst = root.findall('./a')
children = list(root)
for c in children:
    root.remove(c)
for p in p_lst:
    p.tail = ''
    root.append(p)
for a in a_lst:
    a.tail = ''
    root.append(a)
root.text = ''
ET.dump(root)

output

<?xml version="1.0" encoding="UTF-8"?>
<div class="A">
   <p>Text1</p>
   <p>Text2</p>
   <a href="url">link text</a>
</div>

**Jack Fleeting** · Accepted Answer · 2021-09-13T15:03:55+00:00

I'm still not entirely sure I understand correctly, but it seems like you may be looking for something resembling this:

target = tree.xpath('//div[@class="A"]')[0]
to_keep = target.xpath('//p | //a')
for t in target.xpath('.//*'):
    if t not in to_keep:
        target.remove(t) #I believe this method is better here than drop_tree()
print(lxml.html.tostring(target).decode())

The output I get is your expected output.

TechQA.

How to keep all html elements with selector but drop all others?

There are 2 answers

Related Questions in PYTHON

Related Questions in LXML

Related Questions in LXML.HTML

Popular Questions

Trending Questions