I've created this NOACCENT COLLATION in Python3 to use it in an sqlite database program I'm trying to write.
My aim is to make accented vocals insensitive searches and to make data input insenstive to accents too so that (i.e.) the value 'González' will not be admitted if there is a 'Gonzalez' value already in that table field.
What I did works well for the second part (restrict values) and for complete value searches. I cannot make it to work for SELECT ... LIKE expressions. I know about the LIKE limitation of COLLATE NOCASE in SQLite for unicode characters out of a-z, A-Z but why that limitation applies when I'm using my own defined collation NOACCENT ? 'Cause it seems that's what happening.
Example:
cur.execute(""" CREATE TABLE Demo(
Id INTEGER PRIMARY KEY,
Nombre TEXT COLLATE NOACCENT UNIQUE)""")
cur.execute("INSERT INTO Demo(Nombre) VALUES ('álberto')")
cur.execute("INSERT INTO Demo(Nombre) VALUES ('alberta')")
cur.execute("INSERT INTO Demo(Nombre) VALUES ('Álbertu')")
cur.execute("INSERT INTO Demo(Nombre) VALUES ('Álberte')")
cur.execute("INSERT INTO Demo(Nombre) VALUES ('Albertczo')")
cur.execute("INSERT INTO Demo(Nombre) VALUES ('albertai')")
If I do:
cur.execute("SELECT * FROM Demo WHERE Nombre='ALBERTO' COLLATE NOACCENT")
I get:
(1, 'álberto)
What is fine. This works with any combination of upper and lower case and any accent in any vocal of the word ('ALBERTÓ', 'Älbèrtô'...)
If i try to insert the value 'ALBERTÓ' i get an error due to the unique constraint so this is working well.
The problem is when i try to use LIKE expressions:
cur.execute("SELECT * FROM Demo WHERE Nombre LIKE 'ALBERT%' COLLATE NOACCENT")
That returns:
(2, 'alberta')
(5, 'Albertczo')
(6, 'albertai')
The values without accent.
Is there any solution for this from the database itself or will I have to make partial searches (i.e. 'albert') through software.
Hope my question is clear enough.
Thanks in advance.
def noaccent(string): # eliminate accented vocals
vocals = ['a', 'e', 'i', 'o', 'u'] * 4
acc_vocals = ['á', 'é', 'í', 'ó', 'ú','à', 'è', 'ì', 'ò', 'ù', 'ä', 'ë', 'ï', 'ö', 'ü', 'â', 'ê', 'î', 'ô', 'û']
no_acc = string
for letter in acc_vocals:
no_acc = no_acc.replace(letter, vocals[acc_vocals.index(letter)])
return no_acc
def collate_noaccent(string1, string2):
str1 = noaccent(string1.lower())
str2 = noaccent(string2.lower())
if str1 < str2:
return -1
elif str1 > str2:
return 1
else:
return 0
LIKE ignores collations, and always uses ASCII rules.
If you want to use LIKE pattern matching with other collations, you have to store a normalized form of the text in another column.