Scraping with BeautifulSoup: want to scrape entire column including header and title rows

2k views Asked by At

I'm trying to get a hold of the data under the columns having the code "SEVNYXX", where "XX" are the numbers that follow (eg. 01, 02, etc) on the site using Python.

With the code below I can get the first row of all the Columns data that I want. However, is there a way I could include the header and row Titles to these?

I know I have the Headers, but I was wondering if there is a way to include these in the data that is outputted? And, also how could I look to include all the rows?

from bs4 import BeautifulSoup
from urllib import request

page = request.urlopen('http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html').read()
soup = BeautifulSoup(page)

desired_table = soup.findAll('table')[2]

# Find the columns you want data from
headers = desired_table.findAll('th')
desired_columns = []
for th in headers:
    if 'SVENY' in th.string:
        desired_columns.append(headers.index(th))

# Iterate through each row grabbing the data from the desired columns
rows = desired_table.findAll('tr')

for row in rows[1:]:
    cells= row.findAll('td')
    for column in desired_columns:
        print(cells[column].text)
1

There are 1 answers

6
double_j On BEST ANSWER

How's this?

I added th.getText() and created a list on the desired columns which pulled the column name, and then added row_name = row.findNext('th').getText() to get the row.

from bs4 import BeautifulSoup
from urllib import request

page = request.urlopen('http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html').read()
soup = BeautifulSoup(page)

desired_table = soup.findAll('table')[2]

# Find the columns you want data from
headers = desired_table.findAll('th')
desired_columns = []
for th in headers:
    if 'SVENY' in th.string:
        desired_columns.append([headers.index(th), th.getText()])

# Iterate through each row grabbing the data from the desired columns
rows = desired_table.findAll('tr')

for row in rows[1:]:
    cells = row.findAll('td')
    row_name = row.findNext('th').getText()
    for column in desired_columns:
        print(cells[column[0]].text, row_name, column[1])