case insensitive tag matching with xml-conduit?

123 views Asked by At

What's the best way to perform case-insensitive tag and attribute name matching using xml-conduit?

For example, consider the findNodes function from the HTML parsing example on FP Complete's School of Haskell:

https://www.fpcomplete.com/school/starting-with-haskell/libraries-and-frameworks/text-manipulation/tagsoup

-- The data we're going to search for
findNodes :: Cursor -> [Cursor]
findNodes = element "span" >=> attributeIs "class" "sb_count" >=> child

(I've modified this line to that it will work with the Bing's current page structure.)

My experiments indicate that element and attributeIs do not perform case-insensitive comparisons when matching names. Is there an easy way to change this?

2

There are 2 answers

1
Michael Snoyman On BEST ANSWER

You can use laxElement to ignore case when matching elements. It will also ignore namespaces. It should be pretty easy to write a wrapper around checkName that has the exact semantics you're looking for.

0
ErikR On

I've found a work-around... still interested in a cleaner solution.

Basically we just create our own version of Text.HTML.DOM which fixes up the tag and attribute names in tag event stream just before the XML tree is created.

The function eventConduit begins like this:

eventConduit :: Monad m => Conduit S.ByteString m XT.Event
eventConduit =
    TS.tokenStream =$= go []
  where
    go stack = do
        mx <- await
        case fmap (entities . fmap' (decodeUtf8With lenientDecode)) mx of
            Nothing -> closeStack stack
...

We change the case fmap ... line to:

        case fmap (entities . fixNames . fmap' (decodeUtf8With lenientDecode)) mx of

where fixNames is defined as:

fixNames :: TS.Token' Text -> TS.Token' Text
fixNames (TS.TagOpen x pairs b) = TS.TagOpen (T.toLower x) (map (T.toLower *** id) pairs) b
fixNames (TS.TagClose x)        = TS.TagClose (T.toLower x)
fixNames t                      = t

Now we just use lowercase names in element and attributeIs.