Custom Language Stemmer for Elasticsearch

761 views Asked by At

Is there any way how to create new stemmer? There is for example analyzer for czech language already built in with czech language stemmer. This algorithm was made by some guys in Netherlands. It's not that bad, but for the native speaker it is clear that those honorable guys does not speak the language. If I would like to create my own stemming algorithm, how can I do it in the Elasticsearch?

Thanks.

1

There are 1 answers

0
bpgergo On

Elasticsearch is based on Lucene, so this answer is about how to add a custom stemmer to Lucene.

This is how I implemented Lucene's Analyzer interface based on a custom stemmer (or lemmatizer, to be more precise):

https://code.google.com/p/hunglish-webapp/source/browse/trunk/src/main/java/hu/mokk/hunglish/lucene/analysis/StemmerAnalyzer.java

See also these two classes: https://code.google.com/p/hunglish-webapp/source/browse/trunk/src/main/java/hu/mokk/hunglish/lucene/analysis/CompoundStemmerTokenFilter.java

https://code.google.com/p/hunglish-webapp/source/browse/trunk/src/main/java/hu/mokk/hunglish/jmorph/LemmatizerWrapper.java

Note, that this is for an older version of Lucene, 3.2/3.3. The same implementation would probably be more simple for new versions. https://code.google.com/p/hunglish-webapp/source/browse/trunk/pom.xml