Custom ShingleFilter in solr

232 views Asked by At

My requirement is to have a token filter which can produce the tokens as below -

Text - "Quick brown fox jump"
Tokens:
"Quick"
"Quick brown"
"Quick brown fox"
"Quick brown fox jump"

If I use SingleFilter, then I get extra tokens like - "brown fox" "fox jump" which I don't want. Is there a ready made way to achieve it. Any help would be highly appreciated

1

There are 1 answers

0
root On

Basically you want a prefix search, Try EdgeNGramFilterFactor

this FilterFactory is very useful in matching prefix substrings.

<fieldType name="text_general_edge_ngram" class="solr.TextField" positionIncrementGap="100">
   <analyzer type="index">
      <tokenizer class="solr.LowerCaseTokenizerFactory"/>
      <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15"/>
   </analyzer>
   <analyzer type="query">
      <tokenizer class="solr.LowerCaseTokenizerFactory"/>
   </analyzer>
</fieldType>

Note : minGram and maxGram will decide the token length , so in case the minimum token length will be 2. and maximum token length will be 15. Any token with length less then 2 and greater then 15 will be discarded.

so if you have a string say "a" it will not be tokenized. since the length of the token will be less then 2. same goes for maxlength. so adjust according to your needs.

Also note that using EdgeNgram will increase your index size(As now more tokens are generated for the same string). so take that into account also.