I build a full-ext search index with sqlite and don't understand what is going on internally when i'm scanning documents contain few languages.
For example, i describe a programming topic i'm learning in Russian and add into the description code blocks with programming language syntax statements and comments which are obviously in English.
Let's consider the example document.txt
Вывод хранимых данных производится следующей командой
import storage
def main() # Comments just to represent an example
print(storage.data)
As you can see document.txt consists of two languages.
I use the snowball tokenizer(it reuses standard sowball library) to index the completed documents explicitly specifying CREATE TABLE documents USING FTS5(text, tokenize='snowball russian');
and it handles it with no issues. So here is a point why? The documents contain English words and later on, the index contains English stems along with Russian stems, i can search команда
or commenting
successfully. Is it how things work?