I am trying to scrape a Wiki infobox and put the data into a dictionary where the first column of the infobox is the key and the second column is the value. I also have to ignore all rows that do not have 2 columns. I am having trouble understanding how to get the value associated to the key. The Wikipedia page I am trying to scrape is https://en.wikipedia.org/w/index.php?title=Titanic&oldid=981851347 where I am trying to pull the information from the first infobox.
The results should look like: {"Name": "RMS Titanic", "Owner": "White Star Line", "Operator": "White Star Line", "Port of registry": "Liverpool, UK", "Route": "Southampton to New York City".....}
Here's what I've tried:
import requests
from bs4 import BeautifulSoup
def get_infobox(url):
response = requests.get(url)
bs = BeautifulSoup(response.text)
table = bs.find('table', {'class' :'infobox'})
result = {}
row_count = 0
if table is None:
pass
else:
for tr in table.find_all('tr'):
if tr.find('th'):
pass
else:
row_count += 1
if row_count > 1:
if tr is not None:
result[tr.find('td').text.strip()] = tr.find('td').text
return result
print(get_infobox("https://en.wikipedia.org/w/index.php?title=Titanic&oldid=981851347"))
Any help would be greatly appreciated!
If you do not need or want to use a scraper, you could use the API
https://www.mediawiki.org/wiki/API:Main_page/de
The english endpoint is https://en.wikipedia.org/w/api.php
E.g.:
https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&titles=Titanic&rvsection=0