Is there an easy and standard way to customize Lucene snowball stemmer?

312 views Asked by At

I'm using Lucene 7.x and ItalianStemmer. I have seen the code of ItalianStemmer class and it seems to take long to be understood. So, I'm looking for a quick (possibly standard) way to customize italian stemmer, without extending ItalianStemmer or SnowballProgram, because I have few days.

The point is that I don't understand why the name "saluto" (greeting) is stemmed to "sal". It should be stemmed to "salut", since the verb "salutare" (greet) is stemmed to "salut". Moreover, "sala" (room) and "sale" (rooms) are also stemmed to "sal", which is confusing, because they have a different meaning.

1

There are 1 answers

0
femtoRgon On BEST ANSWER

The standard way would be to copy the source, and create your own.

Stemming is a heuristic process, based on rules. It is designed to generate stems that, while imperfect, are usually good enough to facilitate search. It doesn't have a dictionary of conjugated words and their stems for you to modify. -uto is one of the verb suffixes removed from words by the Italian snowball stemmer, as described here. You could create your own version removing that suffix from the list, but you are probably going to create more problems than you solve, all told.

A tool that returns the correct root word would generally be called a lemmatizer, and I don't believe any come with Lucene, out of the box. The morphological analysis tends to be slower and more complex. If it's important to your use case, you might want to look up an Italian lemmatizer, and work it into a custom filter, or preprocess your text before passing it off the to the analyzer.