XQuery fuzzy search in exist-db

346 views Asked by At

We use exist-db base for storing various xml documents, on which we perform searches using xquery. This is example of xml document:

<person personID="some_id">
    <name>
        <familyName>Doe</familyName>
        <firstName>John</firstName>
    </name>
</person>

The search we are using is fuzzy search, and the query is in following form

xquery version "3.0";
for $doc in collection('/db/Persons')/*[ft:query(.,'milan~')] 
let $score := ft:score($doc) 
order by $score descending return base-uri($doc)

The problem is that search orders results rather strange. For example, it ranks Milun, Milun, Golan, Vilon before Milan. In other words, search assigns greater score to the results that are not exact match compared to the exact match ( Milan ). What are we doing wrong? Is there a way for exact matches to have higher scores compared to near-exact matches?

1

There are 1 answers

0
Joe Wicentowski On BEST ANSWER

eXist-db's full text search index is built on top of Apache Lucene. This problem was reported in the Lucene bug tracker (see https://issues.apache.org/jira/browse/LUCENE-329) and other products built on it, like ElasticSearch (see https://github.com/elastic/elasticsearch/issues/20369), and a fix was made in Lucene 5.3 with https://svn.apache.org/viewvc?view=revision&revision=1680548.

In order for eXist to benefit from this improvement, eXist would need to upgrade its Lucene libraries from the version in the current release of eXist, Lucene 4.10.4, to Lucene 5.3 or higher. Some API incompatibilities between Lucene 4.x and 5.x+ have so far prevented eXist from making this jump (see the open issue https://github.com/eXist-db/exist/issues/1160), but I believe the challenge isn't insurmountable.

In the meantime, as a workaround, you could add an additional query that looks for an exact match, and returns this only or as the first hit above the fuzzy matches. Depending on your application you may need to strip out the tilde from user-supplied input, but this should accomplish your goal.