How to keep all html elements with selector but drop all others?

66 views Asked by At

I would like to get a HTML string without certain elements. However, upfront I just know which elements to keep but don't know which ones to drop.

Let's say I just want to keep all p and a tags inside the div with class="A".

Input:

<div class="A">
  <p>Text1</p>
  <img src="A.jpg">
  <div class="sub1">
    <p>Subtext1</p>
  </div>
  <p>Text2</p>
  <a href="url">link text</a>
</div>
<div class="B">
  ContentDiv2
</div>

Expected output:

<div class="A">
  <p>Text1</p>
  <p>Text2</p>
  <a href="url">link text</a>
</div>

If I'd know all the selectors of all other elements I could just use lxml's drop_tree(). But the problem is that I don't know ['img', 'div.sub1', 'div.B'] upfront.

Example with drop_tree():

import lxml.cssselect
import lxml.html

tree = lxml.html.fromstring(html_str)

elements_drop = ['img', 'div.sub1', 'div.B']
for j in elements_drop:
    selector = lxml.cssselect.CSSSelector(j)
    for e in selector(tree):
        e.drop_tree()

output = lxml.html.tostring(tree)
2

There are 2 answers

1
Jack Fleeting On BEST ANSWER

I'm still not entirely sure I understand correctly, but it seems like you may be looking for something resembling this:

target = tree.xpath('//div[@class="A"]')[0]
to_keep = target.xpath('//p | //a')
for t in target.xpath('.//*'):
    if t not in to_keep:
        target.remove(t) #I believe this method is better here than drop_tree()
print(lxml.html.tostring(target).decode())

The output I get is your expected output.

1
balderman On

Try the below. The idea is to clean the root and add the required sub elements.

Note that no external lib is required.

import xml.etree.ElementTree as ET

html = '''<div class="A">
  <p>Text1</p>
  <img src="A.jpg"/>
  <div class="sub1">
    <p>Subtext1</p>
  </div>
  <p>Text2</p>
  <a href="url">link text</a>
  ContentDiv2
</div>'''
root = ET.fromstring(html)
p_lst = root.findall('./p')
a_lst = root.findall('./a')
children = list(root)
for c in children:
    root.remove(c)
for p in p_lst:
    p.tail = ''
    root.append(p)
for a in a_lst:
    a.tail = ''
    root.append(a)
root.text = ''
ET.dump(root)

output

<?xml version="1.0" encoding="UTF-8"?>
<div class="A">
   <p>Text1</p>
   <p>Text2</p>
   <a href="url">link text</a>
</div>