How to modify the HTML object of a request-html response?

111 views Asked by At

Let's say I get some html successfully using the following:

from requests_html import HTMLSession

session = HTMLSession()
url = https://example.com
html = session.get(url).html

Now I want to modify that html and then save it to a local file. How would I do that?

I want to update the href attribute, but this doesn't do it:

for a in html.find["a"]:
    link = a.attr['href']  # https://example.com/page1.html
    a.attr['href'] = "page1.html"

with open("index.html", "wb") as f:
    f.write(html.raw_html)

Is there a way to do this with requests-html, or do I have to use lxml, bs4, or qyquery to edit the html?

1

There are 1 answers

2
Yuri R On

It's better to manually reconstruct the HTML from the modified elements.

from requests_html import HTMLSession
from lxml import html

session = HTMLSession()
url = "https://example.com"
r = session.get(url)

# Parse the HTML
doc = html.fromstring(r.content)

# Modify the href attribute of each <a> tag
for a in doc.xpath('//a'):
    link = a.get('href', '')  # Get current href, default to empty string if not present
    # Modifying the href attribute (update this as per your requirement)
    new_link = link.replace('https://example.com/', '')
    a.set('href', new_link)

# Reconstruct the HTML from the modified elements
modified_html = html.tostring(doc, encoding='unicode')

# Saving the modified HTML to a local file
with open("index.html", "w", encoding="utf-8") as f:
    f.write(modified_html)

In the requests_html library, when you use the .find() method, it returns a list of Element objects. These Element objects represent the HTML elements as they were at the time of parsing. Modifying these Element objects does not directly change the underlying HTML source text.