improve lucene.net analyzer

210 views Asked by At

I'm using lucene.net and the snowball analyzer in a asp.net application.

With a specific language I'm using I have the following issue: For two specific words with different meanings after they are stemmed the result is the same, therefore a search for any of them will produce results for both things.

How can I teach the analyzer either not to stem this two words or to, although stemming them, know that they have different meanings.

2

There are 2 answers

0
femtoRgon On

With Lucene 4.0, EnglishAnalyzer now has this ability, since it has a constructor which takes a stemExclusionSet

Of course, Lucene.Net isn't up to Lucene 4 yet, so fat lot of good that does.

However, EnglishAnalyzer does this by using a KeywordMarkerFilter. So you can create your own Analyzer, overriding the tokenStream method, and adding into the chain a KeywordMarkerFilter just before the SnowballFilter.

Something like:

public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = new StandardTokenizer(reader);
    result = new StandardFilter(result);
    result = new LowerCaseFilter(result);
    if (stopSet != null)
        result = new StopFilter(result, stopSet);
    result = new KeywordMarkerFilter(result, stemExclusionSet);
    result = new SnowballFilter(result, name);
    return result;
}

You'll need to construct your own stemExclusionSet (see CharArraySet).

1
Lord Darth Vader On

I am working from memory here but as I recall in one of the constructors you can pass an array of stopwords, which will stop the passed in words from being stemmed.