Extract data from STATIC HTML FILE using python 3.5

8k views Asked by At

I have static HTML page saved on local machine. I tried using simple file open and BeautifulSoup. With file open its doesn't read entire html file due to unicode error and BeautifulSoup it works for live websites.

#with beautifulSoup
from bs4 import BeautifulSoup
import urllib.request
url="Stack Overflow.html"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page.read())
universities=soup.find_all('a',class_='institution')
for university in universities:
    print(university['href']+","+university.string)


#Simple file read
with open('Stack Overflow.html', encoding='utf-8') as f:
    for line in f:
        print(repr(line))

After reading HTML, I wish to extract data from ul and li which doesn't have any attributes. Any recommendation are welcome.

2

There are 2 answers

7
yumere On BEST ANSWER

I don't know what you exactly mean. I just understand that you want to read entire html data from local storage and parse some DOM with bs4.

right?

I suggest some code here:

from bs4 import BeautifulSoup

with open("Stack Overflow.html", encoding="utf-8") as f:
    data = f.read()
    soup = BeautifulSoup(data, 'html.parser')
    # universities = soup.find_all('a', class_='institution')
    # for university in universities:
    #     print(university['href'] + "," + university.string)
    ul_list = soup.select("ul")
    for ul in ul_list:
        if not ul.attrs:
            for li in ul.select("li"):
                if not li.attrs:
                    print(li.get_text().strip())
2
宏杰李 On

This question is about how to construct a BeautifulSoup Object.

To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("index.html"))

soup = BeautifulSoup("<html>data</html>")

Just pass a file object to BeautifulSoup, you do not need to specifically add encoding information, BS will handle it.

First, the document is converted to Unicode, and HTML entities are converted to Unicode characters:

If you have trouble with extracting data, you should post html code.

Extract:

import bs4

html = '''<ul class="indent"> <li><i>dependency-check version</i>: 1.4.3</li> <li><i>Report Generated On</i>: Dec 30, 2016 at 13:33:27 UTC</li> <li><i>Dependencies Scanned</i>:&nbsp;0 (0 unique)</li> <li><i>Vulnerable Dependencies</i>:&nbsp;0</li> <li><i>Vulnerabilities Found</i>:&nbsp;0</li> <li><i>Vulnerabilities Suppressed</i>:&nbsp;0</li> <li class="scaninfo">...</li>'''

soup = bs4.BeautifulSoup(html, 'lxml')
for i in soup.find_all('li', class_=False):
    print(i.text)

out:

dependency-check version: 1.4.3
Report Generated On: Dec 30, 2016 at 13:33:27 UTC
Dependencies Scanned: 0 (0 unique)
Vulnerable Dependencies: 0
Vulnerabilities Found: 0
Vulnerabilities Suppressed: 0