Extract data from STATIC HTML FILE using python 3.5

Question

Extract data from STATIC HTML FILE using python 3.5

8k views Asked by user73324 At 03 January 2025 at 15:05

I have static HTML page saved on local machine. I tried using simple file open and BeautifulSoup. With file open its doesn't read entire html file due to unicode error and BeautifulSoup it works for live websites.

#with beautifulSoup
from bs4 import BeautifulSoup
import urllib.request
url="Stack Overflow.html"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page.read())
universities=soup.find_all('a',class_='institution')
for university in universities:
    print(university['href']+","+university.string)


#Simple file read
with open('Stack Overflow.html', encoding='utf-8') as f:
    for line in f:
        print(repr(line))

After reading HTML, I wish to extract data from ul and li which doesn't have any attributes. Any recommendation are welcome.

Original Q&A

There are 2 answers

宏杰李 On 03 January 2017 at 06:22

This question is about how to construct a BeautifulSoup Object.

To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:
from bs4 import BeautifulSoup

soup = BeautifulSoup(open("index.html"))

soup = BeautifulSoup("<html>data</html>")

Just pass a file object to BeautifulSoup, you do not need to specifically add encoding information, BS will handle it.

First, the document is converted to Unicode, and HTML entities are converted to Unicode characters:

If you have trouble with extracting data, you should post html code.

Extract:

import bs4

html = '''<ul class="indent"> <li><i>dependency-check version</i>: 1.4.3</li> <li><i>Report Generated On</i>: Dec 30, 2016 at 13:33:27 UTC</li> <li><i>Dependencies Scanned</i>:&nbsp;0 (0 unique)</li> <li><i>Vulnerable Dependencies</i>:&nbsp;0</li> <li><i>Vulnerabilities Found</i>:&nbsp;0</li> <li><i>Vulnerabilities Suppressed</i>:&nbsp;0</li> <li class="scaninfo">...</li>'''

soup = bs4.BeautifulSoup(html, 'lxml')
for i in soup.find_all('li', class_=False):
    print(i.text)

out:

dependency-check version: 1.4.3
Report Generated On: Dec 30, 2016 at 13:33:27 UTC
Dependencies Scanned: 0 (0 unique)
Vulnerable Dependencies: 0
Vulnerabilities Found: 0
Vulnerabilities Suppressed: 0

**yumere** · Accepted Answer · 2017-01-03T05:16:50+00:00

I don't know what you exactly mean. I just understand that you want to read entire html data from local storage and parse some DOM with bs4.

right?

I suggest some code here:

from bs4 import BeautifulSoup

with open("Stack Overflow.html", encoding="utf-8") as f:
    data = f.read()
    soup = BeautifulSoup(data, 'html.parser')
    # universities = soup.find_all('a', class_='institution')
    # for university in universities:
    #     print(university['href'] + "," + university.string)
    ul_list = soup.select("ul")
    for ul in ul_list:
        if not ul.attrs:
            for li in ul.select("li"):
                if not li.attrs:
                    print(li.get_text().strip())

TechQA.

Extract data from STATIC HTML FILE using python 3.5

There are 2 answers

Related Questions in PYTHON

Related Questions in BEAUTIFULSOUP

Related Questions in PYTHON-3.5

Related Questions in DATA-EXTRACTION

Related Questions in STATIC-HTML

Popular Questions

Popular Tags

Trending Questions