How do I get all hits from a cts:search() in Marklogic

1.3k views Asked by At

I have a collection containing lots of documents.

when I search the collection, I need to get a list of matches independent of documents. So if I search for the word "pie". I would get back a list of documents, properly sorted by relevance. However, some of these documents contain the word "pie" on more then one place. I would like to get back a list of all matches, unrelated to the document where the match was found. Also, this list of all hits would need the be sorted by relevance (weight), again totally independent of the document (not grouped by the document).

Following code searches and returns matches grouped by the document...

let $searchfor := "pie"

let $query := cts:and-query((
  cts:element-word-query(xs:QName("title"), ($searchfor), (), 16),
  cts:element-word-query(xs:QName("para"), ($searchfor), (), 10)
))

let $resultset := cts:search(fn:collection("docs"), $query)[0 to 100]
for $n in $resultset
  return cts:score($n)

What I need is $n to be the "match-node", not a "document-node"...

Thanks!

4

There are 4 answers

0
Clark Richey On

I recommend that you look at the Search API (http://community.marklogic.com/pubs/5.0/books/search-dev-guide.pdf and http://community.marklogic.com/pubs/5.0/apidocs/SearchAPI.html). This API will give what you want, providing match nodes as well as the URIs for the actual documents. You should also find it easier to use for the general cases, although there will be edge cases where you will need to revert back to cts:search.

search:search is the specific function you will want to use. It will give you back responses similar to this:

    <search:response total="1" start="1" page-length="10" xmlns=""
    xmlns:search="http://marklogic.com/appservices/search">
  <search:result index="1" uri="/foo.xml" 
        path="fn:doc(&quot;/foo.xml&quot;)" score="328" 
        confidence="0.807121" fitness="0.901397">
    <search:snippet>
        <search:match path="fn:doc(&quot;/foo.xml&quot;)/foo">
            <search:highlight>hello</search:highlight></search:match>
    </search:snippet>
  </search:result>
  <search:qtext>hello sample-property-constraint:boo</search:qtext>
  <search:report id="SEARCH-FLWOR">(cts:search(fn:collection(), 
      cts:and-query((cts:word-query("hello", ("lang=en"), 1), 
      cts:properties-query(cts:word-query("boo", ("lang=en"), 1))), 
      ()), ("score-logtfidf"), 1))[1 to 10]
  </search:report>
  <search:metrics>
    <search:query-resolution-time>PT0.647S</search:query-resolution-time>
    <search:facet-resolution-time>PT0S</search:facet-resolution-time>
    <search:snippet-resolution-time>PT0.002S</search:snippet-resolution-time>
    <search:total-time>PT0.651S</search:total-time>
  </search:metrics>
</search:response>

Here you can see that every result has one or possibly more match elements defined.

0
wst On

Document relevance is determined by TFIDF. Matches contribute to a document's score but don't have scores relative to each other. cts:search already returns results ordered by document relevance, so you could do this to get match nodes ordered by their ancestor document score:

let $searchfor := "pie"
let $query := cts:and-query((
  cts:element-word-query(xs:QName("title"), ($searchfor), (), 16),
  cts:element-word-query(xs:QName("para"), ($searchfor), (), 10)
))
return
cts:search(//(title|para),$query)[0 to 100]/cts:highlight(.,$query,element match {$cts:node})//match/*
0
kstirman On

How would you determine the relevance of a word independent of the document? Relevance is a measure of document relevance, not word relevance. I don't know how one would measure word relevance.

You could potentially return all words ordered by document relevance, then words for each document in "document order" which means the order in which they appear in the document. That would be relatively easy to do with search:search where you iterate over all results and extract each matching word. What would you present with each match? Its surrounding snippet?

Keep in mind that what you're asking for would potentially take a long time to execute.

0
ado On

You need to split the document (fragment it) into smaller documents. Every textnode could be a document, with an stored original xpath so that the context is not lost.