Ignoring subtrees in streaming xml-conduit

283 views Asked by At

The xml-conduit documentation only lists examples where the entire XML tree is consumed by a ConduitM, for example:

<people>
    <person age="25">Michael</person>
    <person age="2">Eliezer</person>
</people>

I'm trying to parse a tree where besides the <person> tags from above, there are deeply nested subtrees I'm not interested in (their exact schema might even be unknown), for example:

<people>
    <person age="25">Michael</person>
    <tagImNotInterestedIn><!-- deeply nested complex subtree --></tagImNotInterestedIn>
    <person age="2">Eliezer</person>
</people>

When parsing with the people.hs example from the docs, I get the following exception:

people.hs: XmlException {xmlErrorMessage = "Expected end tag for: Name {nameLocalName = \"people\", nameNamespace = Nothing, namePrefix = Nothing}", xmlBadInput = Just (EventBeginElement (Name {nameLocalName = "tagImNotInterestedIn", nameNamespace = Nothing, namePrefix = Nothing}) [])}

Basically, I'm looking for a way to ignore any tag (including all its children and attributes) except specific ones I specify parsers for. When using DOM-based parsers like HXT, this is obviously easy, but the tag docs explicitly states that it will fail unless all children are consumed.

The only hypothetical way I can think of accomplishing this is to use functions from Control.Exception to build up a Conduit with a Maybe a result (returning Nothing on exception) and then use orE to combine it with the parsers itself

Although it has been stated that the xml-conduit API needs some updating, I think there has to be a less-hackish way to ignore an entire subtree. Any ideas will be appreciated!

1

There are 1 answers

0
palik On

Since 1.5.0 Text.XML.Stream.Parse provides a function takeTree, which probably could be used for the purpose.

{-# LANGUAGE OverloadedStrings #-}

import           Control.Monad                (void)
import           Control.Monad.Trans.Class    (lift)
import           Control.Monad.Trans.Resource (MonadThrow, runResourceT)
import           Data.ByteString.Lazy         (ByteString)
import           Data.ByteString.Lazy.Char8   (concat)
import           Data.Conduit                 (ConduitT, runConduit, (.|))
import           Data.Conduit.List            (mapM_)
import           Data.Text                    (Text, unpack)
import           Data.XML.Types               (Event)
import           Prelude                      hiding (concat, mapM_)
import           Text.XML.Stream.Parse        (choose, content, def,
                                               ignoreAnyTreeContent,
                                               ignoreAttrs, manyYield, many_,
                                               parseLBS, requireAttr, tag',
                                               tagNoAttr, takeTree)

data Person = Person Int Text deriving Show

parsePerson :: MonadThrow m => ConduitT Event o m (Maybe Person)
parsePerson = tag' "person" (requireAttr "age") $ \age -> do
    name <- content
    return $ Person (read $ unpack age) name

parsePeople :: MonadThrow m => ConduitT Event Person m ()
parsePeople = void $ tagNoAttr "people" $
  many_ (choose([takeTree "person" ignoreAttrs, ignoreAnyTreeContent])) .| manyYield parsePerson

persons :: ByteString
persons = concat [
    "<people>"
  , "<foo/>"
  , "<person age=\"25\">Michael</person>"
  , "<bar/>"
  , "<person age=\"2\">Eliezer</person>"
  , "<tagImNotInterestedIn>x</tagImNotInterestedIn>"
  , "</people>"

main :: IO ()
main = runResourceT $
  runConduit $ parseLBS def persons .| parsePeople .| mapM_ (lift . print)

The code above is based on xml-conduit sample. Only parsePeople is changed.

λ> main
Person 25 "Michael"
Person 2 "Eliezer"