Looking to extract records from a table in a very well formed HTMl table using HXT. I've reviewed a couple of examples on SO and the HXT documentation, such as:
- Extracting Values from a Subtree
- http://adit.io/posts/2012-04-14-working_with_HTML_in_haskell.html
- https://www.schoolofhaskell.com/school/advanced-haskell/xml-parsing-with-validation
- Running Haskell HXT outside of IO?
- extract multiples html tables with hxt
- Parsing html in haskell
- http://neilbartlett.name/blog/2007/08/01/haskell-explaining-arrows-through-xml-transformationa/
- https://wiki.haskell.org/HXT/Practical/Simple2
- https://wiki.haskell.org/HXT/Practical/Simple1
- Group html table rows with HXT in Haskell
- Parsing multiple child nodes in Haskell with HXT
My problem is:
I want to identify a table uniquely by a known id, and then for each tr within that table, create a record object and return this as a list of records.
Here's my HTML
<!DOCTYPE html>
<head>
<title>FakeHTML</title>
</head>
<body>
<table id="fakeout-dont-get-me">
<thead><tr><td>Null</td></tr></thead>
<tbody><tr><td>Junk!</td></tr></tbody>
</table>
<table id="Greatest-Table">
<thead>
<tr><td>Name</td><td>Favorite Rock</td></tr>
</thead>
<tbody>
<tr id="rock1">
<td>Fred</td>
<td>Igneous</td>
</tr>
<tr id="rock2">
<td>Bill</td>
<td>Sedimentary</td>
</tr>
</tbody>
</table>
</body>
</html>
Here's the code I'm trying, along with 2 different approaches to parsing this. First, imports ...
{-# LANGUAGE Arrows, OverloadedStrings, DeriveDataTypeable, FlexibleContexts #-}
import Text.XML.HXT.Core
import Text.HandsomeSoup
import Text.XML.HXT.XPath.XPathEval
import Data.Tree.NTree.TypeDefs
import Text.XML.HXT.XPath.Arrows
What I want is a list of Rockrecs, eg from...
recs = [("rock1", "Name", "Fred", "Favorite Rock", "Igneous"),
("rock2", "Name", "Bill", "Favorite Rock", "Sedimentary")]
data Rockrec = Rockrec { rockID:: String,
rockName :: String,
rockFav :: String} deriving Show
rocks = [(\(a,_,b,_,c) -> Rockrec a b c ) r | r <- recs]
-- [Rockrec {rockID = "rock1", rockName = "Fred", rockFav = "Igneous"},
-- Rockrec {rockID = "rock2", rockName = "Bill", rockFav = "Sedimentary"}]
Here's my first way, which uses a bind on runLA after I return a bunch of [XMLTree]. That is, I do a first parse just to get the right table, then I process the tree rows after that first grab.
Attempt 1
getTab = do
dt <- Prelude.readFile "fake.html"
let html = parseHtml dt
tab <- runX $ html //> hasAttrValue "id" (== "Greatest-Table")
return tab
-- hmm, now this gets tricky...
-- table <- getTab
node tag = multi (hasName tag)
-- a la https://stackoverflow.com/questions/3901492/running-haskell-hxt-outside-of-io?rq=1
getIt :: ArrowXml cat => cat (Data.Tree.NTree.TypeDefs.NTree XNode) (String, String)
getIt = (node "tr" >>>
(getAttrValue "id" &&& (node "td" //> getText)))
This kinda works. I need to massage a bit, but can get it to run...
-- table >>= runLA getIt
-- [("","Name"),("","Favorite Rock"),("rock1","Fred"),("rock1","Igneous"),("rock2","Bill"),("rock2","Sedimentary")]
This is a second approach, inspired by https://wiki.haskell.org/HXT/Practical/Simple1. Here, I think I'm relying on something in {-# LANGUAGE Arrows -} (which coincidentally breaks my list comprehension for rec above), to use the proc function to do this in a more readable do block. That said, I can't even get a minimal version of this to compile:
Attempt 2
getR :: ArrowXml cat => cat XmlTree Rockrec
getR = (hasAttrValue "id" (== "Greatest-Table")) >>>
proc x -> do
rockId <- getText -< x
rockName <- getText -< x
rockFav <- getText -< x
returnA -< Rockrec rockId rockName rockFav
EDIT
Trouble with the types, in response to the comment below from Alec
λ> getR [table]
<interactive>:56:1-12: error:
• Couldn't match type ‘NTree XNode’ with ‘[[XmlTree]]’
Expected type: [[XmlTree]] -> Rockrec
Actual type: XmlTree -> Rockrec
• The function ‘getR’ is applied to one argument,
its type is ‘cat0 XmlTree Rockrec’,
it is specialized to ‘XmlTree -> Rockrec’
In the expression: getR [table]
In an equation for ‘it’: it = getR [table]
λ> getR table
<interactive>:57:1-10: error:
• Couldn't match type ‘NTree XNode’ with ‘[XmlTree]’
Expected type: [XmlTree] -> Rockrec
Actual type: XmlTree -> Rockrec
• The function ‘getR’ is applied to one argument,
its type is ‘cat0 XmlTree Rockrec’,
it is specialized to ‘XmlTree -> Rockrec’
In the expression: getR table
In an equation for ‘it’: it = getR table
END EDIT
Even if I'm not selecting elements, I can't get the above to run. I'm also a little puzzled at how I should do something like put the first td in rockName and the second td in rockFav, how to include an iterator on these (supposing I have a lot of td fields, instead of just 2.)
Any further general tips on how to do this more painlessly appreciated.
From HXT/Practical/Google1 I think I am able to piece together a solution.
The key takeways are these:
Your arrows work on streams of elements, but not individual elements. This is the
ArrowList
constraint. Thus, callinggetText
three times will produce surprising behavior becausegetText
represents all the different possible text values you could get in the course of streaming<table>
elements through yourproc x -> do {...}
.What we can do instead is focus on the stream we want: a stream of
<tr>
s inside the<tbody>
. For each table row, we grab the ID attribute value and the text of the first two<td>
s.This does not seem the most elegant solution, but one way we can index into a stream is to filter it down with the
(>.) :: ArrowList cat => cat a b -> ([b] -> c) -> cat a c
combinator.One last trick, one that I noticed in the practical wiki examples: we can use
deep
andisElem/isText
to focus on just the nodes we want. XML trees are noisy!