How do we filter all tokens belonging to a certain language using SOLR?

Question

How do we filter all tokens belonging to a certain language using SOLR?

72 views Asked by Swetha Baskaran At 22 June 2015 at 18:50

In my case, I want to filter out all English words from documents that predominantly contain Arabic words.

There are 1 answers

**Alexandre Rafalovitch** · Answer 1 · 2015-06-25T14:05:41+00:00

Assuming the text is in Unicode, English and Arabic letters use different characters and you could filter them out with regular expressions.

So, in Solr, you would use something like PatternReplaceFilterFactory and standard Java regular expressions. Notice that Java's implementation is actually very deep and supports scripts, blocks and other shortcut ways to use Unicode standard ranges.

Solr also has some ICU filters and tokenizers, but they are more for transliteration, transformation and normalization of complex characters.

TechQA.

How do we filter all tokens belonging to a certain language using SOLR?

There are 1 answers

Related Questions in SOLR

Related Questions in INFORMATION-RETRIEVAL

Popular Questions

Popular Tags

Trending Questions