Can anyone help me understand this code (HTML table parsing in lxml, python)?

178 views Asked by At

Background: I need to write an html table parser in python for HTML tables with varying colspans and rowspans. Upon some research I stumbled about this gem. It works well for simple cases without wacky colspans and rowspans, however I've run into a bug. The code assumes that if an element has a colspan of 3, it belongs to three different table headers, while it really only belongs to the table header the colspan falls in the center of. An example of this can be seen at http://en.wiktionary.org/wiki/han#Swedish (open up the declension table under the Swedish section). The code incorrectly returns that "hans" (possessive-neuter-3rd person masculine) belongs to possessive-common-3rd person masculine and possessive-plural-3rd person masculine because it has a colspan of 3. I've tried adding a check to table_to_2d_dict which would create a counter if a colspan > 1, and only count the element as a part of a header if the counter was equal to the the colspan // 2 + 1 (this returns the median of the range(1,colspan+1) which is the value of the table header which the element should be counted as). However, when I implement this check in the location specified in the code below, it doesn't work. To be honest this probably stems from my lack of understanding how this code works, so...

Question: Can someone explain what this code does and why it malfunctions as described above? If someone can implement a fix that'd be great but right now I'm primarily concerned with understanding the code. Thanks

Below is the code with comments that I've added to highlight parts of the code I understand and parts I don't.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from collections import defaultdict


def table_to_list(table):
    dct = table_to_2d_dict(table)
    return list(iter_2d_dict(dct))


def table_to_2d_dict(table):
    result = defaultdict(lambda : defaultdict(str))
    for row_i, row in enumerate(table.xpath('./tr')): #these double for loops iterate over each element in the table
        for col_i, col in enumerate(row.xpath('./td|./th')):
            colspan = int(col.get('colspan', 1)) #gets colspan attr of the element, if none assumes it's 1
            rowspan = int(col.get('rowspan', 1)) #gets rowspan attr of the element, if none assumes it's 1
            col_data = col.text_content() #gets raw text inside element

            #WHAT DOES THIS DO? :(
            while row_i in result and col_i in result[row_i]: 
                col_i += 1
            for i in range(row_i, row_i + rowspan):
                for j in range(col_i, col_i + colspan):
                    result[i][j] = col_data
    return result

#what does this do? :(
def iter_2d_dict(dct):
    for i, row in sorted(dct.items()):
        cols = []
        for j, col in sorted(row.items()):
            cols.append(col)
        yield cols


if __name__ == '__main__':
    import lxml.html
    from pprint import pprint

    doc = lxml.html.parse('tables.html')
    for table_el in doc.xpath('//table'):
        table = table_to_list(table_el)
        pprint(table)
0

There are 0 answers