I'm trying to extract data from html table which is bit unstructured. HTML Table structure as below (sample data) -
Able to extract data but facing issue with the "ID" column. "ID" is the single header for 2 columns, that too structure is not consistent through out the table.
Running the Below Code -
#Libraries
import urllib3, re
import requests
from bs4 import BeautifulSoup,Comment
import pandas as pd
import numpy as np
import re
group_techniques = []
#Loop through the URLs we loaded above
for b in base_url:
html = requests.get(b).text
soup = BeautifulSoup(html, "html.parser")
#provide the table name we want to scrape
group_table = soup.find('table', {"class" : "table techniques-used table-bordered mt-2"})
#try clause to skip any url with missing/empty tables
try:
#loop through table, grab each of the 5 columns shown
for row in group_table.find_all('tr'):
cols = row.find_all('td')
if len(cols) == 5:
group_techniques.append((b, cols[0].text.strip(), cols[1].text.strip(), cols[2].text.strip(),
cols[3].text.strip(),cols[4].text.strip()))
except: pass
#convert output to new array
group_tech_array = np.asarray(group_techniques)
#convert array to dataframe
df_grp_tech = pd.DataFrame(group_tech_array)
#rename columns, check output
df_grp_tech.columns = ['Domain','Tech_ID','sub_id','Name','Use']
When we compare Actual Output vs Expected output -
- We are missing row #3 (T1560) from the raw html table
- We are missing row #5 (T1059) from the raw html table It is because of complex table structure
Actual Table structure after extraction
Expected Table structure
** HTML Table ** Here is the link "Table - Techniques Used"
It's much simpler, in this case at least, to use pandas:
And that's it. The output is your target table.