How do we filter all tokens belonging to a certain language using SOLR?

65 views Asked by At

In my case, I want to filter out all English words from documents that predominantly contain Arabic words.

1

There are 1 answers

0
Alexandre Rafalovitch On

Assuming the text is in Unicode, English and Arabic letters use different characters and you could filter them out with regular expressions.

So, in Solr, you would use something like PatternReplaceFilterFactory and standard Java regular expressions. Notice that Java's implementation is actually very deep and supports scripts, blocks and other shortcut ways to use Unicode standard ranges.

Solr also has some ICU filters and tokenizers, but they are more for transliteration, transformation and normalization of complex characters.