xQuery FLWOR How would I calculate the frequency of words that show up

227 views Asked by At

I am looking through an XML file and trying to find words that come after the word "has" and I am trying to work out how to count the frequency of each word. Currently I have found all words that come after the word "has" but this contains duplicates. How would I make it so I group the 'successor' words and do a count on each?

I am using xQuery 1.0

Snippet of the XML file:


-<s n="2">

<w pos="CONJ" hw="that" c5="CJT-DT0">That </w>

<w pos="PRON" hw="you" c5="PNP">you</w>

<w pos="VERB" hw="be" c5="VBB">'re </w>

<w pos="VERB" hw="greet" c5="VVN">greeted </w>

<w pos="PREP" hw="in" c5="PRP">in </w>

<w pos="ART" hw="the" c5="AT0">the </w>

<w pos="ADJ" hw="first" c5="ORD">first </w>

<w pos="SUBST" hw="place" c5="NN1-VVB">place </w>

<w pos="PREP" hw="with" c5="PRP">with </w>

<w pos="UNC" hw="erm" c5="UNC">erm </w>

<w pos="ADV" hw="either" c5="AV0">either </w>

<w pos="SUBST" hw="silence" c5="NN1-VVB">silence </w>

<w pos="CONJ" hw="or" c5="CJC">or </w>

<w pos="ADJ" hw="some" c5="DT0">some </w>

<w pos="ADJ" hw="vague" c5="AJ0">vague </w>

<w pos="CONJ" hw="and" c5="CJC">and </w>

<w pos="ADV" hw="not" c5="XX0">not </w>

<w pos="ADV" hw="singularly" c5="AV0">singularly </w>

<w pos="ADJ" hw="hopeful" c5="AJ0">hopeful </w>

<w pos="SUBST" hw="mutter" c5="NN1-VVB">mutter</w>

<c c5="PUN">, </c>

<w pos="CONJ" hw="but" c5="CJC">but </w>

<w pos="ADV" hw="more" c5="AV0">more </w>

<w pos="ADV" hw="importantly" c5="AV0">importantly </w>

<w pos="PREP" hw="with" c5="PRP">with </w>

<w pos="ART" hw="a" c5="AT0">a </w>

<w pos="ADJ" hw="curious" c5="AJ0">curious </w>

<w pos="SUBST" hw="facial" c5="NN1-AJ0">facial </w>

<w pos="SUBST" hw="expression" c5="NN1">expression </w>

<w pos="VERB" hw="mingle" c5="VVD-VVN">mingled </w>

<w pos="PREP" hw="between" c5="PRP">between </w>

<w pos="UNC" hw="erm" c5="UNC">erm </w>

<w pos="SUBST" hw="dread" c5="NN1">dread </w>

<w pos="CONJ" hw="and" c5="CJC">and </w>

<w pos="SUBST" hw="contempt" c5="NN1">contempt</w>

<c c5="PUN">, </c>

<w pos="SUBST" hw="sort" c5="NN1">sort </w>

<w pos="PREP" hw="of" c5="PRF">of </w>

<w pos="SUBST" hw="thing" c5="NN1">thing </w>

<w pos="PRON" hw="you" c5="PNP">you</w>

<w pos="VERB" hw="would" c5="VM0">'d </w>

<w pos="VERB" hw="expect" c5="VVI">expect </w>


-<mw c5="CJS">

<w pos="PREP" hw="as" c5="PRP">as </w>

<w pos="CONJ" hw="if" c5="CJS">if </w>

</mw>

<w pos="PRON" hw="you" c5="PNP">you</w>

<w pos="VERB" hw="have" c5="VHD">'d </w>

<w pos="VERB" hw="say" c5="VVN">said </w>

<w pos="PRON" hw="you" c5="PNP">you </w>

<w pos="VERB" hw="be" c5="VBD">were </w>

<w pos="ART" hw="a" c5="AT0">a </w>

<w pos="SUBST" hw="sorcerer" c5="NN1">sorcerer</w>

<c c5="PUN">.</c>

</s>


-<s n="3">

<vocal desc="laugh"/>

<w pos="PRON" hw="i" c5="PNP">I </w>

<w pos="VERB" hw="find" c5="VVB">find </w>

<w pos="PRON" hw="myself" c5="PNX">myself </w>

<w pos="ART" hw="the" c5="AT0">the </w>

<w pos="ADJ" hw="only" c5="AJ0">only </w>

<w pos="SUBST" hw="thing" c5="NN1">thing </w>

<w pos="VERB" hw="be" c5="VBZ">is </w>

<w pos="PREP" hw="to" c5="TO0">to </w>

<w pos="VERB" hw="change" c5="VVI">change </w>

<w pos="ART" hw="the" c5="AT0">the </w>

<w pos="SUBST" hw="subject" c5="NN1">subject</w>

<c c5="PUN">.</c>

</s>


-<s n="4">

<w pos="ADJ" hw="this" c5="DT0">This </w>

<w pos="UNC" hw="erm" c5="UNC">erm </w>

<w pos="SUBST" hw="reaction" c5="NN1">reaction </w>

<w pos="PREP" hw="to" c5="PRP">to </w>

<w pos="ART" hw="the" c5="AT0">the </w>

<w pos="SUBST" hw="disclosure" c5="NN1">disclosure </w>

<w pos="PRON" hw="i" c5="PNP">I </w>

<w pos="VERB" hw="think" c5="VVB">think</w>

<w pos="VERB" hw="be" c5="VBZ">'s </w>

<w pos="ADJ" hw="exaggerated" c5="AJ0-VVN">exaggerated </w>

<w pos="CONJ" hw="but" c5="CJC">but </w>

<w pos="PREP" hw="on" c5="PRP">on </w>

<w pos="ART" hw="the" c5="AT0">the </w>

<w pos="ADJ" hw="other" c5="AJ0">other </w>

<w pos="SUBST" hw="hand" c5="NN1">hand </w>

<w pos="PRON" hw="there" c5="EX0">there</w>

<w pos="VERB" hw="be" c5="VBZ">'s </w>

<w pos="PRON" hw="something" c5="PNI">something </w>

<w pos="PREP" hw="in" c5="PRP">in </w>

<w pos="PRON" hw="it" c5="PNP">it</w>

<c c5="PUN">.</c>

</s>

My current code for getting all the words after the target word 'has':

<html>
<body>
<table border='1'>
<tr><td>Target</td><td>Successor</td></tr>

{
for $targetword in (collection("./?select=*xml"))//s//w
where lower-case(normalize-space($targetword))="has"
let $successor := lower-case(normalize-space($targetword/following-sibling::w[1]))
return <tr><td>{data($targetword)}</td><td>{$successor}</td></tr>
}
</table>
</body>
</html>

Any help will be appreciated

1

There are 1 answers

3
Yitzhak Khabinsky On

I am using BaseX.

You need to add grouping to the FLWOR expresssion

XQuery

xquery version "1.0";

<html>
<body>
<table border='1'>
<thead>
  <tr><th>Target</th><th>Successor</th><th>Rank</th></tr>
</thead>
<tbody>
{
  for $targetword in doc("e:\Temp\Hassan_Grouping.xml")//w
  where lower-case(normalize-space($targetword))="you"
  let $successor := lower-case(normalize-space($targetword/following-sibling::w[1]))
  group by $successor
  return <tr>
      <td>{data($targetword[1])}</td>
      <td>{$successor}</td>
      <td>{count($targetword)}</td>
    </tr>
}
</tbody></table>
</body>
</html>

Output

<html>
  <body>
    <table border="1">
      <thead>
        <tr>
          <th>Target</th>
          <th>Successor</th>
          <th>Rank</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <td>you</td>
          <td>'re</td>
          <td>1</td>
        </tr>
        <tr>
          <td>you</td>
          <td>'d</td>
          <td>2</td>
        </tr>
        <tr>
          <td>you</td>
          <td>were</td>
          <td>1</td>
        </tr>
      </tbody>
    </table>
  </body>
</html>