I am dealing with very complex table elements in html. There are multiple colspans that act as titles/sub-titles. 'td' (columns) and rows are sometimes not matched. How do I decompose the table to Pure text while still maintaining its readability? Is there a module/lib that can does the work for me?
here's a example of the html table, but it could get more complex
Here's the code I tried with the table, it is very ugly yes but it sometimes work.
for table in soup.find_all('table'):
try:
sections = []
current_section_header = None
current_section_rows = []
for row in table.find_all('tr'):
colspan_cell = row.find(['td', 'th'], colspan=True)
if colspan_cell:
if current_section_header:
sections.append((current_section_header, current_section_rows))
current_section_header = colspan_cell
current_section_rows = []
else:
current_section_rows.append(row)
if current_section_header:
sections.append((current_section_header, current_section_rows))
for header, rows in sections:
output_html_content += header.get_text(strip=True) + "<br><br>"
if len(rows) == 1:
for cell in rows[0].find_all(['td', 'th']):
output_html_content += cell.get_text(strip=True) + ' '
else:
headers = [cell.get_text(strip=True) for cell in rows[0].find_all(['td', 'th'])]
for row in rows[1:]:
cells = row.find_all(['td', 'th'])
for header, cell in zip(headers, cells):
output_html_content += f"{header}: {cell.get_text(strip=True)}<br><br>"
output_html_content += "<br><br>"
I tried to do the colspan as a header/title. Then I match the text with the first row after colspan.
But when the data gets really complex, this doesn't work because the rows and columns aren't always matching. The p tags can be reading from top to bottom.