I have a table that has rows and each row has 6 columns. I have read the values in the columns and added them to a dataframe but there are links in column 1 and column 6 which i also want to add. I admit that i am new to python. i need help.

I have already tried creating a new dataframe and storing the link in first column but the rows in both dataframes are not coming equal.

import urllib3
from bs4 import BeautifulSoup
import pandas as pd
import time

COLUMNS = ['Legal Name', 'Status', 'Size', 'Suburb or Town', 'State', 'ABN']
COLUMNS2 = ['Link1']

urls = []
for i in range(3):
     quotepage = "https://www.acnc.gov.au/charity?items_per_page=60&"
     quotepage = quotepage + "facet__select__field_beneficiaries=0&"
     quotepage = quotepage + "facet__select__field_countries=0&"
     quotepage = quotepage + "facet__select__acnc_search_api_sub_history=0&"
     quotepage = quotepage + "facet__select__field_status=307&"
     quotepage = quotepage + "page="+str(i)+"#search"

     #print (quotepage)
     urls.append(quotepage)

i=0

dataframes = []
dataframes2 = []

cy_data = []
cy_data2 = []
for url in urls:
    i=i+1
    print(i)
    http = urllib3.PoolManager()
    response = http.request('GET', url)
    soup = BeautifulSoup(response.data, "html5lib")
    pagetable = soup.find('table')
    rows = soup.find("table").find_all('tr') 

    time.sleep(.5)
    for row in rows:
        cells = row.find_all("td") 
        cells = cells[0:6] # Select the correct columns
        cy_data.append([cell.text.strip() for cell in cells])

    links = pagetable.find_all("a")
    for link in links:
        if len(link["href"]) == 41:# href for charity
             cy_data2.append(link["href"])

dataframes.append(pd.DataFrame(cy_data, columns=COLUMNS).drop(0, axis=0))
dataframes2.append(pd.DataFrame(cy_data2, columns=COLUMNS2).drop(0, axis=0))
#data = pd.concat([dataframes, dataframes2], axis=1)
 data = pd.concat(dataframes)
 data2 = pd.concat(dataframes2)

I want to add the links to the dataframe, thats all.

1 Answers

0
Community On Best Solutions

Don't drop the zero index from the DataFrames, like so:

dataframes.append(pd.DataFrame(cy_data, columns=COLUMNS))
dataframes2.append(pd.DataFrame(cy_data2, columns=COLUMNS2))

And change finding the table rows code to:

rows = soup.find("table").find("tbody").find_all('tr')

Result:

DataFrame 1  [180 rows x 6 columns]
DataFrame 2  [180 rows x 1 columns]