SQLITE custom Accent collation function and LIKE queries

4.9k views Asked by At

I've created this NOACCENT COLLATION in Python3 to use it in an sqlite database program I'm trying to write.

My aim is to make accented vocals insensitive searches and to make data input insenstive to accents too so that (i.e.) the value 'González' will not be admitted if there is a 'Gonzalez' value already in that table field.

What I did works well for the second part (restrict values) and for complete value searches. I cannot make it to work for SELECT ... LIKE expressions. I know about the LIKE limitation of COLLATE NOCASE in SQLite for unicode characters out of a-z, A-Z but why that limitation applies when I'm using my own defined collation NOACCENT ? 'Cause it seems that's what happening.

Example:

cur.execute(""" CREATE TABLE Demo(
            Id INTEGER PRIMARY KEY,
            Nombre TEXT COLLATE NOACCENT UNIQUE)""")

cur.execute("INSERT INTO Demo(Nombre) VALUES ('álberto')")
cur.execute("INSERT INTO Demo(Nombre) VALUES ('alberta')")
cur.execute("INSERT INTO Demo(Nombre) VALUES ('Álbertu')")
cur.execute("INSERT INTO Demo(Nombre) VALUES ('Álberte')")
cur.execute("INSERT INTO Demo(Nombre) VALUES ('Albertczo')")
cur.execute("INSERT INTO Demo(Nombre) VALUES ('albertai')")

If I do:

cur.execute("SELECT * FROM Demo WHERE Nombre='ALBERTO' COLLATE NOACCENT")

I get:

(1, 'álberto)

What is fine. This works with any combination of upper and lower case and any accent in any vocal of the word ('ALBERTÓ', 'Älbèrtô'...)

If i try to insert the value 'ALBERTÓ' i get an error due to the unique constraint so this is working well.

The problem is when i try to use LIKE expressions:

cur.execute("SELECT * FROM Demo WHERE Nombre LIKE 'ALBERT%' COLLATE NOACCENT")

That returns:

(2, 'alberta')
(5, 'Albertczo')
(6, 'albertai')

The values without accent.

Is there any solution for this from the database itself or will I have to make partial searches (i.e. 'albert') through software.

Hope my question is clear enough.

Thanks in advance.


def noaccent(string): # eliminate accented vocals
    vocals = ['a', 'e', 'i', 'o', 'u'] * 4 
    acc_vocals = ['á', 'é', 'í', 'ó', 'ú','à', 'è', 'ì', 'ò', 'ù', 'ä', 'ë', 'ï', 'ö', 'ü', 'â', 'ê', 'î', 'ô', 'û']
    no_acc = string
    for letter in acc_vocals:
        no_acc = no_acc.replace(letter, vocals[acc_vocals.index(letter)])
    return no_acc

def collate_noaccent(string1, string2):
    str1 = noaccent(string1.lower())
    str2 = noaccent(string2.lower())
    if str1 < str2:
        return -1
    elif str1 > str2:
        return 1
    else:
        return 0
2

There are 2 answers

1
CL. On

LIKE ignores collations, and always uses ASCII rules.

If you want to use LIKE pattern matching with other collations, you have to store a normalized form of the text in another column.

1
Dysmas On

I solved this problem by creating my own "like" function in the SqliteUnicode class I have designed. It slows down searchs by a factor of 8. Since the class is not yet optimized, I prefer not to post it here now, unless you ask.

    link = sqlite.connect(db_file_s)
    extension = SqliteUnicode()
    link.create_collation("france", extension.collate)
    link.create_function("like", 2, extension.like)

Edit 2019 : you can find the class here : https://pastebin.com/FvMZdBY5 Please, note I am not a professional, and there are still some cases where the personalised "like" may cause errors. The regular expressions have to be improved, but for our usage in a library database, it is working fine for years now. If you improve this work, please, share.