I am using xml-conduit and Text.XML.Cursor to navigate some terrible html with nested tables. There is a table with two tbody tags and I want the immediate child tr tags of the first tbody. Here is my code so far:
getIdentityTableBody :: Cursor -> [Cursor]
getIdentityTableBody
= element "table" >=> hasAttribute "summary" >=>
attributeIs "summary" "Issuer Identity Information"
&// element "tbody" >=> child >=> element "tr"
But this gets all the descendant trs of both tbody tags. I simply don't know how to get the first tbody alone, and am confused about filtering only for immediate children in that tbody.
Here is the html I am trying to parse.
<table summary="Issuer Identity Information" width="100%">
<tbody>
<tr>
<th width="33%" class="FormText">CIK (Filer ID Number)</th>
<th width="10%" class="FormText">Previous Names</th>
<td width="23%">
<table border="0" summary="Table with single CheckBox">
<tbody><tr>
<td class="CheckBox"><span class="FormData">X</span></td>
<td align="left" class="FormText">None</td>
</tr>
</tbody></table>
</td>
<th width="33%" class="FormText">Entity Type</th>
</tr>
<tr>
<td>
<a href="http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001614286">0001614286</a>
</td>
<td rowspan="5" colspan="2" valign="top"></td>
<td rowspan="7" valign="top">
<table width="100%" border="0" summary="Table with Multiple boxes">
<tbody><tr>
<td class="CheckBox"> </td>
<td class="FormText">Corporation</td>
</tr>
<tr>
<td class="CheckBox"><span class="FormData">X</span></td>
<td class="FormText">Limited Partnership</td>
</tr>
<tr>
<td class="CheckBox"> </td>
<td class="FormText">Limited Liability Company</td>
</tr>
<tr>
<td class="CheckBox"> </td>
<td class="FormText">General Partnership</td>
</tr>
<tr>
<td class="CheckBox"> </td>
<td class="FormText">Business Trust</td>
</tr>
<tr>
<td class="CheckBox"> </td>
<td class="FormText">Other (Specify)</td>
</tr>
</tbody></table>
<br>
</td>
</tr>
<tr>
<th class="FormText">Name of Issuer</th>
</tr>
<tr>
<td class="FormData">SRA US Equity Fund, LP</td>
</tr>
<tr>
<th class="FormText">Jurisdiction of Incorporation/Organization</th>
</tr>
<tr>
<td class="FormData">DELAWARE</td>
</tr>
<tr>
<th class="FormText" colspan="2">Year of Incorporation/Organization</th>
</tr>
<tr>
<td colspan="3">
<table border="0" summary="Year of Incorporation/Organization">
<tbody>
<tr>
<td class="CheckBox"> </td>
<td class="FormText">Over Five Years Ago</td>
</tr>
<tr>
<td class="CheckBox"><span class="FormData">X</span></td>
<td class="FormText">Within Last Five Years (Specify Year)</td>
<td><span class="FormData">2014</span></td>
</tr>
<tr>
<td class="CheckBox"> </td>
<td class="FormText">Yet to Be Formed</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
The issue is that
&// element "tbody"
says "find every single tbody descendant", including tbody tags that are inside other tbody tags. What about using&/
instead, which gets just the directtbody
descendants of thetable
element?Two other comments:
hasAttribute
andattributeIs
. Just confirming that the attribute has the given value will also check that it exists.