Records from <tr>s in an Html table using Arrows and HXT in Haskell

246 views Asked by At

Looking to extract records from a table in a very well formed HTMl table using HXT. I've reviewed a couple of examples on SO and the HXT documentation, such as:

My problem is:

I want to identify a table uniquely by a known id, and then for each tr within that table, create a record object and return this as a list of records.

Here's my HTML

<!DOCTYPE html>
<head>
  <title>FakeHTML</title>
</head>
<body>
  <table id="fakeout-dont-get-me">
    <thead><tr><td>Null</td></tr></thead>
    <tbody><tr><td>Junk!</td></tr></tbody>
  </table>
  <table id="Greatest-Table">
    <thead>
      <tr><td>Name</td><td>Favorite Rock</td></tr>
    </thead>
    <tbody>
      <tr id="rock1">
        <td>Fred</td>
        <td>Igneous</td>
      </tr>
      <tr id="rock2">
        <td>Bill</td>
        <td>Sedimentary</td>
      </tr>
    </tbody>
  </table>
</body>
</html>

Here's the code I'm trying, along with 2 different approaches to parsing this. First, imports ...

{-# LANGUAGE Arrows, OverloadedStrings, DeriveDataTypeable, FlexibleContexts  #-}
import Text.XML.HXT.Core
import Text.HandsomeSoup
import Text.XML.HXT.XPath.XPathEval
import Data.Tree.NTree.TypeDefs
import Text.XML.HXT.XPath.Arrows

What I want is a list of Rockrecs, eg from...

recs = [("rock1", "Name", "Fred", "Favorite Rock", "Igneous"),
        ("rock2", "Name", "Bill", "Favorite Rock", "Sedimentary")]

data Rockrec = Rockrec { rockID:: String,
                         rockName :: String,
                         rockFav :: String} deriving Show

rocks = [(\(a,_,b,_,c) -> Rockrec a b c ) r | r <- recs]
-- [Rockrec {rockID = "rock1", rockName = "Fred", rockFav = "Igneous"},
--  Rockrec {rockID = "rock2", rockName = "Bill", rockFav = "Sedimentary"}]

Here's my first way, which uses a bind on runLA after I return a bunch of [XMLTree]. That is, I do a first parse just to get the right table, then I process the tree rows after that first grab.

Attempt 1

getTab = do
  dt <- Prelude.readFile "fake.html"
  let html = parseHtml dt
  tab <- runX $ html //> hasAttrValue "id" (== "Greatest-Table")
  return tab
  -- hmm, now this gets tricky...

-- table <- getTab

node tag = multi (hasName tag)

-- a la https://stackoverflow.com/questions/3901492/running-haskell-hxt-outside-of-io?rq=1
getIt  :: ArrowXml cat => cat (Data.Tree.NTree.TypeDefs.NTree XNode) (String, String)
getIt = (node "tr" >>>
         (getAttrValue "id" &&& (node "td" //> getText)))

This kinda works. I need to massage a bit, but can get it to run...

-- table >>= runLA getIt
-- [("","Name"),("","Favorite Rock"),("rock1","Fred"),("rock1","Igneous"),("rock2","Bill"),("rock2","Sedimentary")]

This is a second approach, inspired by https://wiki.haskell.org/HXT/Practical/Simple1. Here, I think I'm relying on something in {-# LANGUAGE Arrows -} (which coincidentally breaks my list comprehension for rec above), to use the proc function to do this in a more readable do block. That said, I can't even get a minimal version of this to compile:

Attempt 2

 getR :: ArrowXml cat => cat XmlTree Rockrec
 getR = (hasAttrValue "id" (== "Greatest-Table")) >>>
   proc x -> do
     rockId <- getText -< x
     rockName <- getText -< x
     rockFav <- getText -< x
     returnA -< Rockrec rockId rockName rockFav

EDIT

Trouble with the types, in response to the comment below from Alec

λ> getR [table]

<interactive>:56:1-12: error:
    • Couldn't match type ‘NTree XNode’ with ‘[[XmlTree]]’
      Expected type: [[XmlTree]] -> Rockrec
        Actual type: XmlTree -> Rockrec
    • The function ‘getR’ is applied to one argument,
      its type is ‘cat0 XmlTree Rockrec’,
      it is specialized to ‘XmlTree -> Rockrec’
      In the expression: getR [table]
      In an equation for ‘it’: it = getR [table]
λ> getR table

<interactive>:57:1-10: error:
    • Couldn't match type ‘NTree XNode’ with ‘[XmlTree]’
      Expected type: [XmlTree] -> Rockrec
        Actual type: XmlTree -> Rockrec
    • The function ‘getR’ is applied to one argument,
      its type is ‘cat0 XmlTree Rockrec’,
      it is specialized to ‘XmlTree -> Rockrec’
      In the expression: getR table
      In an equation for ‘it’: it = getR table

END EDIT

Even if I'm not selecting elements, I can't get the above to run. I'm also a little puzzled at how I should do something like put the first td in rockName and the second td in rockFav, how to include an iterator on these (supposing I have a lot of td fields, instead of just 2.)

Any further general tips on how to do this more painlessly appreciated.

1

There are 1 answers

0
hao On BEST ANSWER

From HXT/Practical/Google1 I think I am able to piece together a solution.

{-# LANGUAGE Arrows #-}
{-# LANGUAGE ScopedTypeVariables #-}
module Hanzo where
import Text.HandsomeSoup
import Text.XML.HXT.Cor

atTag tag =
  deep (isElem >>> hasName tag)
text =
  deep isText >>> getText

data Rock = Rock String String String deriving Show    
rocks =
  atTag "tbody" //> atTag "tr"
  >>> proc x -> do
        rowID <- x >- getAttrValue "id"
        name <- x >- atTag "td" >. (!! 0) >>> text
        kind <- x >- atTag "td" >. (!! 1) >>> text
        returnA -< Rock rowID name kind

main = do
  dt <- readFile "html.html"
  result <- runX $ parseHtml dt
                   //> hasAttrValue "id" (== "Greatest-Table")
                   >>> rocks
  print result

The key takeways are these:

  • Your arrows work on streams of elements, but not individual elements. This is the ArrowList constraint. Thus, calling getText three times will produce surprising behavior because getText represents all the different possible text values you could get in the course of streaming <table> elements through your proc x -> do {...}.

  • What we can do instead is focus on the stream we want: a stream of <tr>s inside the <tbody>. For each table row, we grab the ID attribute value and the text of the first two <td>s.

  • This does not seem the most elegant solution, but one way we can index into a stream is to filter it down with the (>.) :: ArrowList cat => cat a b -> ([b] -> c) -> cat a c combinator.

  • One last trick, one that I noticed in the practical wiki examples: we can use deep and isElem/isText to focus on just the nodes we want. XML trees are noisy!