Similarity search using Solr's NGramFilterFactory

961 views Asked by At

I am trying to use the NGramFilterFactory in Solr (using Sunspot in Rails) to find similar titles. I managed to add a new field to my solr schema.xml like follows:

<fieldType name="text_ngrm" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="4"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

As I am using sunpsot in a rails app and therefore added the new field through a dynamic field to sunspot. This all worked and I can now search my model using the NGramFilterFactory. What I am not sure about is, how to configure solr in order to search for similar titles. Here are my concrete questions:

  1. Does it make sense to use the dismax query parser when I am trying to query similar titles?
  2. How can the (Minimum 'Should' Match) parameter help me to find similar titles?
  3. Based on what exactly would I choose the ngram min. and max. sizes?

Thanks for any feedback.

1

There are 1 answers

0
polmiro On

There's several things you could do:

  1. dismax does not have fuzzy search. So if you want to return 'holmes' when the user search for 'homes' or 'halmes' it would be best if you changed to edismax parser.
  2. Minimum 'Should' Match can help you define how flexible your search results will be depneding on the number of words that match. Let's suppose a user looks for 'Batman Dark Night' and you have 'Batman Darker Night' and 'Batman Returns' records tokenized. If mm is 2 only the 'Batman Dark Night' will be returned because it matches the minimum number of words 'Batman' and 'Night'. On the other hand, 'Batman Returns' only matches one of them so it won't be returned.
  3. NGramFilterFactory is good mainly for autocompleting. I think PorterStemFilterFactory fits better with what you are looking for. You can find some info here http://wiki.apache.org/solr/LanguageAnalysis#Notes_about_solr.PorterStemFilterFactory