How do I extract texts from tables in html while still maintaining its form? BS4/Python

20 views Asked by At

I am dealing with very complex table elements in html. There are multiple colspans that act as titles/sub-titles. 'td' (columns) and rows are sometimes not matched. How do I decompose the table to Pure text while still maintaining its readability? Is there a module/lib that can does the work for me?

here's a example of the html table, but it could get more complex

Here's the code I tried with the table, it is very ugly yes but it sometimes work.

        for table in soup.find_all('table'):
            try: 
                sections = []
                current_section_header = None
                current_section_rows = []
                
                for row in table.find_all('tr'):
                    colspan_cell = row.find(['td', 'th'], colspan=True)
                    if colspan_cell:
                        if current_section_header:
                            sections.append((current_section_header, current_section_rows))
                        current_section_header = colspan_cell
                        current_section_rows = []
                    else:
                        current_section_rows.append(row)
                
                if current_section_header:
                    sections.append((current_section_header, current_section_rows))
                
                for header, rows in sections:
                    output_html_content += header.get_text(strip=True) + "<br><br>"
                    if len(rows) == 1:
                        for cell in rows[0].find_all(['td', 'th']):
                            output_html_content += cell.get_text(strip=True) + ' '
                    else:
                        headers = [cell.get_text(strip=True) for cell in rows[0].find_all(['td', 'th'])]
                        for row in rows[1:]:
                            cells = row.find_all(['td', 'th'])
                            for header, cell in zip(headers, cells):
                                output_html_content += f"{header}: {cell.get_text(strip=True)}<br><br>"
                    output_html_content += "<br><br>"

I tried to do the colspan as a header/title. Then I match the text with the first row after colspan.

But when the data gets really complex, this doesn't work because the rows and columns aren't always matching. The p tags can be reading from top to bottom.

0

There are 0 answers