Hibernate Search with Lucene matches configuration

35 views Asked by At

I'm completely new to Hibernate Search and I'm facing a bug in which the searching is matching 007a7358924e4a60923c6a57f58333bf when the query term is 0001. The field in question is the following:

@FullTextField(analyzer = "edgeNgram")
@Column(name = "serial")
private String serial;

The edgeNgram is declared as:

@Override
public void configure(final LuceneAnalysisConfigurationContext context) {
  context.analyzer("edgeNgram").custom()
      .tokenizer(WhitespaceTokenizerFactory.class)
      .charFilter(HTMLStripCharFilterFactory.class)
      .tokenFilter(ASCIIFoldingFilterFactory.class)
      .tokenFilter(LowerCaseFilterFactory.class)
      .tokenFilter(SnowballPorterFilterFactory.class)
      .tokenFilter(EdgeNGramFilterFactory.class)
      .param("minGramSize", "2")
      .param("maxGramSize", "32");
}

And the matching is done with:

private SearchPredicate matchField(SearchPredicateFactory f, String field, String search) {
  return f.match().field(field).matching(search).toPredicate();
}

I don't know if this bug makes sense, since I suppose this is how this engine works, and the essence of searching is showing you results which are not exact. But this was raised as a bug, and I'm looking for someway to make 0001 or 000 to not match the previous string.

I'm open to include any code that you may find useful. I don't really know how to outline this question in a clearer way.

1

There are 1 answers

0
mark_o On BEST ANSWER

You should try defining a different analyzer to be applied to your search terms without including the ngram filter:

@Override
public void configure(final LuceneAnalysisConfigurationContext context) {
  context.analyzer("edgeNgram").custom()
      .tokenizer(WhitespaceTokenizerFactory.class)
      .charFilter(HTMLStripCharFilterFactory.class)
      .tokenFilter(ASCIIFoldingFilterFactory.class)
      .tokenFilter(LowerCaseFilterFactory.class)
      .tokenFilter(SnowballPorterFilterFactory.class)
      .tokenFilter(EdgeNGramFilterFactory.class)
      .param("minGramSize", "2")
      .param("maxGramSize", "32");
  context.analyzer("searchAnalyzer").custom()
      .tokenizer(WhitespaceTokenizerFactory.class)
      // this one probably also doesn't make sense (unless your search query includes HTML...):
      //.charFilter(HTMLStripCharFilterFactory.class)
      .tokenFilter(ASCIIFoldingFilterFactory.class)
      .tokenFilter(LowerCaseFilterFactory.class)
      .tokenFilter(SnowballPorterFilterFactory.class);
}

and then in your entity:

@FullTextField(analyzer = "edgeNgram", searchAnalyzer = "searchAnalyzer")
@Column(name = "serial")
private String serial;

what happens is that the same analysis is applied to your search string "0001", and it is tokenized as [00, 000, 0001]; since your document value 007a7358924e4a60923c6a57f58333bf starts with 00 you are getting a match.