Whoosh fuzzy matching of the queried word list

902 views Asked by At

By the fuzzy match here , I mean to find the documents which have like 60-70% of word matches from the word list in query.

Eg :

>> #(Query string as passed by user)
>> query =  i am searching for a document that is matched fuzzily with what i am giving here.
>> QueryParser("content", ix.schema).parse(query)

This query will look for documents with all the words but i want to find all those documents which contain at least 60% or more of the above words.

Since the count of words that I would be dealing with is large and I do not want programatically partitioning of this word set into different sets (for ORing).

1

There are 1 answers

0
Assem On

This seems Not implemented yet in Whoosh (Checked 28/05/2015).

However, in the documentation of [whoosh.query.Or][1], there is a reference to a minmatch argument:

class whoosh.query.Or (subqueries, boost=1.0, minmatch=0, scale=None)

Parameters:

  • subqueries – a list of Query objects to search for.

  • boost – a boost factor to apply to the scores of all matching documents.

  • minmatchnot yet implemented.

  • scale – a scaling factor for a “coordination bonus”. If this value is not None, it should be a floating point number greater than 0 and less than 1. The scores of the matching documents are boosted/penalized based on the number of query terms that matched in the document. This number scales the effect of the bonuses.

If we supposed minmatch is the minimal matched keywords so the solution whould be like

from math import ceil
from whoosh.query import Or, Term
raw_query = "i am searching for a document that is matched fuzzily with what i am giving here."
min_ratio = ceil(len(raw_query) * 3.0 / 5.0)
query = Or([Term("content", word) for word in raw_query.split()], minmatch = min_ratio)

In this case, you should ignore stop filtering or you should filter the stopwords from the query before calculating the length of query.