SQLite FTS5 SELECT doesn't MATCH Ukrainian 'Усмiшки' with search string 'усмi*', but with 'усм*' it does

281 views Asked by At

I created FTS5 virtual table in SQLite 3 and inserted many titles in it
CREATE VIRTUAL TABLE books_fts USING fts5(title, content='');

When I try to do Full Text Search with
SELECT rowid FROM books_fts WHERE title MATCH ?;
with parameter values 'усмi*' or 'ночiв*' I get no rows.

But with 'усм*' or 'ноч*' I get rowids for right corresponding tokens 'Усмiшки' or 'Ночiвля'.

Is it FTS5 bug in SQLite or I do something wrong?

Thanks in advance for the replies

Update:

sqlite3 fts.db
SQLite version 3.31.1 2020-01-27 19:55:54
Enter ".help" for usage hints.
sqlite> DROP TABLE IF EXISTS books_fts;
sqlite> CREATE VIRTUAL TABLE books_fts USING fts5(title, content='');
sqlite> INSERT INTO books_fts (rowid, title) VALUES (1, 'Усмiшки');
sqlite> INSERT INTO books_fts (rowid, title) VALUES (2, 'Ночiвля');
sqlite> SELECT rowid FROM books_fts WHERE title MATCH 'усм*';
1
sqlite> SELECT rowid FROM books_fts WHERE title MATCH 'усмi*';
1
sqlite> SELECT rowid FROM books_fts WHERE title MATCH 'ноч*';
2
sqlite> SELECT rowid FROM books_fts WHERE title MATCH 'ночiв*';
2
sqlite> .q

I think something wrong with encoding in my code. I'll try to find out

Update 2: As I managed to find out today, the problem is more complicated than I supposed.

The case is that in database UTF-8 letter "i" is Cyrillic 0xd196 / 0xd1 0x96 (2 bytes) but in match pattern "i" is Latin 0x69 (1 byte)

I have tried to tune tokenizer

CREATE VIRTUAL TABLE books_fts USING fts5(title, content='', tokenize = 'unicode61 remove_diacritics 2');

but it didn't help because it seems it works only for Latin diacritics :(

Has anyone encountered a similar problem and found the solution?

1

There are 1 answers

0
Serguei Vine On

From all the above, I conclude that the program code and the SQLite3 work correctly as they should. And to perform a full text search the user need a keyboard with the desired native language layout on the phone for the correct input of national letters.

Thanks again everyone.