How do I Web Scrape a Wikipedia Infobox Table?

1.8k views Asked by At

I am trying to scrape a Wiki infobox and put the data into a dictionary where the first column of the infobox is the key and the second column is the value. I also have to ignore all rows that do not have 2 columns. I am having trouble understanding how to get the value associated to the key. The Wikipedia page I am trying to scrape is https://en.wikipedia.org/w/index.php?title=Titanic&oldid=981851347 where I am trying to pull the information from the first infobox.

The results should look like: {"Name": "RMS Titanic", "Owner": "White Star Line", "Operator": "White Star Line", "Port of registry": "Liverpool, UK", "Route": "Southampton to New York City".....}

Here's what I've tried:

    import requests
    from bs4 import BeautifulSoup

    def get_infobox(url):
       response = requests.get(url)
       bs = BeautifulSoup(response.text)

       table = bs.find('table', {'class' :'infobox'})
       result = {}
       row_count = 0
       if table is None:
         pass
       else:
         for tr in table.find_all('tr'):
             if tr.find('th'):
                 pass
             else:
                 row_count += 1
         if row_count > 1:
             if tr is not None:
               result[tr.find('td').text.strip()] = tr.find('td').text
         return result

print(get_infobox("https://en.wikipedia.org/w/index.php?title=Titanic&oldid=981851347"))

Any help would be greatly appreciated!

1

There are 1 answers

0
Nico Bleiler On