Sqlite FTS5 punctuation marks not working in select query

2.3k views Asked by At

I'm doing full text search using sqlite and below are some select query examples that I'm using.

Ex:

  1. SELECT * FROM table WHERE table MATCH 'column:father's' ORDER BY rank;

  2. SELECT * FROM table WHERE table MATCH 'column:example:' ORDER BY rank;

  3. SELECT * FROM table WHERE table MATCH 'column:month&' ORDER BY rank;

Since i'm using ' : & characters within the search text, these queries are giving me errors. I have tried using escape characters(\-backslash) before the puncuation marks as well.

Any solution to search the punctation marks(, . / " ' - & etc.) in fts5 with the MATCH operator?

These characters are working _, €, £, ¥ with the match operator

Thanks

3

There are 3 answers

0
user5235812 On BEST ANSWER

This seems to be a duplicate of this question. Try the top answer there that states you should enclose your search string within both single and double quotes.

0
Jaymon On

I'd like to walk through a full example because I've found it easy to get subtle and unexpected results with fts5.

First, while wrapping the search string will possibly give you the right answer, it might not be what you actually wanted, here is an example to illustrate:

$ sqlite3 ":memory:"
sqlite> CREATE VIRTUAL TABLE IF NOT EXISTS bad USING fts5(term, tokenize="unicode61");
sqlite>
sqlite> INSERT INTO bad (term) VALUES ('father''s');
sqlite>
sqlite> SELECT * from bad WHERE term MATCH 'father';
father's
sqlite> SELECT * from bad WHERE term MATCH '"father''s"';
father's
sqlite> SELECT * from bad WHERE term MATCH 's';
father's

Notice how the s matches father's also? That's because when you run father's through the tokenizer it will get tokenized according to the following rules by default:

An FTS5 bareword is a string of one or more consecutive characters that are all either:

  • Non-ASCII range characters (i.e. unicode codepoints greater than 127), or
  • One of the 52 upper and lower case ASCII characters, or
  • One of the 10 decimal digit ASCII characters, or
  • The underscore character (unicode codepoint 96).
  • The substitute character (unicode codepoint 26).

So father's would get tokenized to father and s, which may or may not be what you wanted, but for the sake of this answer I'm going to assume that's not what you wanted.

So how would you tell the tokenizer to keep father's together? By using the tokenchars option of the tokenize param:

tokenchars This option is used to specify additional unicode characters that should be considered token characters, even if they are white-space or punctuation characters according to Unicode 6.1. All characters in the string that this option is set to are considered token characters.

Let's look at another example, this time one that uses tokenchars:

$ sqlite3 ":memory:"
sqlite> CREATE VIRTUAL TABLE IF NOT EXISTS good USING fts5(term, tokenize="unicode61  tokenchars '''&:'");
sqlite>
sqlite> INSERT INTO good (term) VALUES ('father''s');
sqlite> INSERT INTO good (term) VALUES ('month&');
sqlite> INSERT INTO good (term) VALUES ('example:');
sqlite>
sqlite> SELECT count(*) from good WHERE term MATCH 'father';
0
sqlite> SELECT count(*) from good WHERE term MATCH '"father''s"';
1
sqlite> SELECT count(*) from good WHERE term MATCH 'example';
0
sqlite> SELECT count(*) from good WHERE term MATCH '"example:"';
1
sqlite> SELECT count(*) from good WHERE term MATCH 'month';
0
sqlite> SELECT count(*) from good WHERE term MATCH '"month&"';
1

Those results seem more expected. But what about our random s result from the first example?

sqlite> SELECT count(*) from good WHERE term MATCH 's';
0

Great!

Hopefully this helps you set up the table the way you intended.

0
The Demz On
# fathers'
SELECT * FROM table WHERE table MATCH 'column:"father''s"';

# example:
SELECT * FROM table WHERE table MATCH 'column:"example:"';

# month&
SELECT * FROM table WHERE table MATCH 'column:"month&"';