I have static HTML page saved on local machine. I tried using simple file open and BeautifulSoup. With file open its doesn't read entire html file due to unicode error and BeautifulSoup it works for live websites.
#with beautifulSoup
from bs4 import BeautifulSoup
import urllib.request
url="Stack Overflow.html"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page.read())
universities=soup.find_all('a',class_='institution')
for university in universities:
print(university['href']+","+university.string)
#Simple file read
with open('Stack Overflow.html', encoding='utf-8') as f:
for line in f:
print(repr(line))
After reading HTML, I wish to extract data from ul
and li
which doesn't have any attributes. Any recommendation are welcome.
I don't know what you exactly mean. I just understand that you want to read entire html data from local storage and parse some DOM with
bs4
.right?
I suggest some code here: