Dynamic Web Scraping with Helium

Question

Dynamic Web Scraping with Helium

971 views Asked by Minura Punchihewa At 26 November 2020 at 19:35

There is a web page which contains links to multiple articles and I want to be able to visit each of these articles and extract the text contained in them. For this purpose, I have used the Helium Python package and written out a script, however, I keep running into the same error.

Given below is the script that I have used. I am basically trying to extract all the paragraph tags and create a Word document out of them. It works fine when I test it out on a single article, however, using this loop causes me to run into the stated error.

from helium import *
import time
from docx import Document
from docx.shared import Inches

document = Document()

start_chrome('some url', headless = True)

time.sleep(5)
article_list = find_all(S('a'))

for article in article_list:
    url = article.web_element.get_attribute('href')
    if url.startswith('some substring'):
        go_to(url)
        time.sleep(5)
        paragraph_list = find_all(S('p'))
        for paragraph in paragraph_list:
            document.add_paragraph(paragraph.web_element.text)

This is the error that I keep getting,

StaleElementReferenceException            Traceback (most recent call last)
<ipython-input-10-7a524350ae24> in <module>()
      1 for article in article_list:
----> 2     url = article.web_element.get_attribute('href')
      3     print(url)
      4     if url.startswith('some url'):
      5         go_to(url)

StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
  (Session info: headless chrome=86.0.4240.198)
  (Driver info: chromedriver=2.38.552522 (437e6fbedfa8762dec75e2c5b3ddb86763dc9dcb),platform=Windows NT 10.0.19041 x86_64)

I am quite new to web scraping, so I don't know if there is something simple that I am missing. Any help here would be much appreciated.

Original Q&A

There are 1 answers

**Minura Punchihewa** · Accepted Answer · 2020-11-27T17:28:50+00:00

I was able to solve this issue. I believe the problem was that the URLs I collected were in their relative states. A better way to go about this is to collect all of the URLs into a list and then go from there, as opposed to generating it by iterating through the elements (articles) themselves. The code for this is as follows,

from helium import *
import time
from docx import Document
from docx.shared import Inches

document = Document()

start_chrome('some url', headless = True)

time.sleep(5)
article_list = find_all(S('a'))

href_list = [article.web_element.get_attribute('href') for article in article_list]

for href in href_list:
    if href.startswith('some substring'):
        go_to(href)
        time.sleep(5)
        paragraph_list = find_all(S('p'))
        for paragraph in paragraph_list:
            document.add_paragraph(paragraph.web_element.text)

document.save('Extract.docx')

TechQA.

Dynamic Web Scraping with Helium

There are 1 answers

Related Questions in PYTHON

Related Questions in WEB-SCRAPING

Related Questions in STALEELEMENTREFERENCEEXCEPTION

Related Questions in HELIUM

Popular Questions

Trending Questions