Get content of Cursor from unnormalized xml

199 views Asked by At

Suppose there is xml file:

            <span id="assignee-val">

        <span class="user-hover" id="issue_summary_assignee_m" rel="m">
        <span class="aui-avatar aui-avatar-small"><div class="aui-avatar-inner"><img src="/secure/useravatar?size=small&amp;avatarId=10222" /></div></span>
        This Value!
    </span>
</span>

The question is how to get "This Value!" out of this xml.

This is what I've got :(

> :m + Control.Applicative Data.ByteString.Lazy Text.HTML.DOM Text.XML.Cursor
> Prelude.map content . (element "span" >=> "id" `attributeIs` "assignee-val" >=> child >=> element "span" >=> "class" `attributeIs` "user-hover" >=> child) . fromDocument . parseLBS <$> Data.ByteString.Lazy.readFile "xmlfile" 
[["\n            "],[],["\n            This Value!\n        "]]
  1. Why there are 3 answers? What query will define the content inside <span class="user-hover"> tag more precisely?
  2. How to remove space indentations and newline symbols automatically?

UPD: in other words, the question is how to drop all nested tags (it doesn't matter how many there will be) and get first level content only, which is "This Value!" (and spaces and newlines).

2

There are 2 answers

0
jamshidh On BEST ANSWER

Question 1- Why are there 3 answers?

The data you have navigated to holds the children of the "user-hover" span tag.... Pulling out the unimportant stuff, your node looks like this

<span class="user-hover">
    <span />
    This Value!
</span>

An XML parser sees this as

<span class="user-hover">[TextNode "\n    "]<span />[TextNode "\n    This Value!\n"]</span>

So, the "user-hover" element does in fact have 3 children.

[TextNode "\n    ", <span />, TextNode "\n    This Value!\n"]

You then apply "content" to each of these values. Since the span element doesn't have any internal content in it, it returns "", and you get:

[["\n    "], [], ["\n    This Value!\n"]]

Question 2- How do you remove space indentations and newline symbols automatically?

According to the xml spec, an xml parser must preserve space. There might be tools in the XML cursor lib to strip this space for you (some xml processing libraries give you options to turn on automatic post-processing whitespace stripping), but I am unaware of it. Just strip the whitespace in another call after the query.

You can use the Data.Text.strip function to do the whitespace stripping for you.


To get the value you want, you need more information in the query.... Will the data always be in the third position of the "user-hover" span element? Will it always be after a <span class="aui-avatar aui-avatar-small" /> element? Will it be all the content in the user-hover element concatenated with spaces stripped? Once you answer this, the solution should be obvious.


Updated answer-

With the extra info you supplied, I can add more info to the answer.

The short answer is- remove the "Prelude.map content", and add a ">=> content" in the pipeline, and then add one more Data.Text.concat to the final output.

Here are the details of why....

Almost all the functions in Text.XML.Cursor are of the form a->[a], where the idea is to apply each filter to a list of nodes, then concat the results. This very closely resembles what happens in XPath, and was clearly modeled after that.

The nice thing is, the pattern I just described is exactly how the array monad works.... If you chain together a bunch of a->[a] functions using bind (>>=), the pipeline will basically do a concat . map f to each stage in the pipeline. When you added the map content to the front, it worked, but only did half of the intended job that the library intended it to do in a full XPath like tool. It pulled out the text content, but never concatenated the result. When used this way, content returns a list of only the text in text nodes inside an element. You still need the one last concat to join those text items together.

When I used the pipeline:

Data.Text.concat . (child >=> element "span" >=> "id" `attributeIs` "assignee-val" >=> child >=> element "span" >=> "class" `attributeIs` "user-hover" >=> child >=> content) . fromDocument . parseLBS <$> Data.ByteString.Lazy.readFile "file.xml" 

I got the result

"\n        \n        This Value!\n    "

You can still strip the final result with Data.Text.strip if you want to....

0
Venge On

The reason there are multiple answers is that the user-hover span has multiple children: the child before the aui-avatar span (which just contains whitespace), the aui-avatar span, and the one containing "This Value!". To get the very last value, you should just look at the last element of your result set as opposed to rewriting your query:

λ> import Control.Applicative
λ> import qualified Data.ByteString.Lazy as L
λ> import qualified Data.Text as T
λ> import Text.HTML.DOM
λ> import Text.XML.Cursor
λ> :set -XOverloadedStrings
λ> let assignee = element "span" >=> "id" `attributeIs` "assignee-val"
λ> let hover = element "span" >=> "class" `attributeIs` "user-hover"
λ> map T.strip . content . last . (assignee >=> child >=> hover >=> child) . fromDocument . parseLBS <$> L.readFile "xmlfile"
["This Value!"]