extract multiples html tables with hxt

358 views Asked by At

my problem is i have to extracts all of tables from an html document and put them in a list of tables.

Hence i understand that the ending function type should be

getTable :: a [XmlTree] [[String]]

for example with the following xml:

<table class="t1">
<tr>
    <td>x</td>
    <td>y</td>
</tr>
<tr>
    <td>a</td>
    <td>b</td>
</tr>
</table>
<table class="t2">
<tr>
    <td>3</td>
    <td>5</td>
</tr>
<tr>
    <td>toto</td>
    <td>titi</td>
</tr>
</table>

i know how to retrieve all the rows from one xmlTree (example1) or all the tags "tables" which provides me the type [XmlTree], but i don't know how to map the arrow example1 inside the result of test2.

I'm sure its obvious but i can't find it.

test2 ::  IO [[XmlTree]]
test2 = runX $ parseXML "table.xml" >>> is "table">>> listA getChildren

example1 ::  ArrowXml a => a XmlTree [String]
example1  = is "table" /> listA (getChildren >>> is "td"  /> getText)
1

There are 1 answers

1
shang On BEST ANSWER

Using the same general idea that you have in example1, we can write getTable like this

getTable :: ArrowXml a => a XmlTree [[String]]
getTable =  hasName "table" >>> listA (rows >>> listA cols) where
    rows = getChildren >>> hasName "tr"
    cols = getChildren >>> hasName "td" /> getText

Running the arrow on your example document produces

[[["x","y"],["a","b"]],[["3","5"],["toto","titi"]]]