How does full-text search snowball algorithm interpret words of an unspecified language

101 views Asked by At

I build a full-ext search index with sqlite and don't understand what is going on internally when i'm scanning documents contain few languages.

For example, i describe a programming topic i'm learning in Russian and add into the description code blocks with programming language syntax statements and comments which are obviously in English.

Let's consider the example document.txt

Вывод хранимых данных производится следующей командой

import storage
def main()  # Comments just to represent an example
    print(storage.data)

As you can see document.txt consists of two languages.

I use the snowball tokenizer(it reuses standard sowball library) to index the completed documents explicitly specifying CREATE TABLE documents USING FTS5(text, tokenize='snowball russian'); and it handles it with no issues. So here is a point why? The documents contain English words and later on, the index contains English stems along with Russian stems, i can search команда or commenting successfully. Is it how things work?

0

There are 0 answers