Solr tokenizer filter substring

70 views Asked by At

Is there a method to index a field so that each substring containing a word would be treated as separate tokens?

For example, input: "hello world, how are you?"

output: "hello world how are you", "hello world how are", "hello world how", "hello world", "hello"

This would be used in combination of SuggestComponent to provide autosuggestion for users.

1

There are 1 answers

0
Mysterion On

In principle, something like solr.ShingleFilterFactory could do the trick for you. It has 2 params: minShingleSize and maxShingleSize, so it will generate a lot of tokens for you and some of them could be not useful for you (also it will mean for you a lot of wasted space on disk)

Potentially, you need either to filter out not needed tokens or potentially to write your own filter.