In my application, I have a users table, with first_name
and last_name
. I currently have a third column full_name
(automatically generated) like this : first_name + last_name + first_name (without special chars).
"Etienne", "De Crécy", "Etienne De Crecy Etienne"
At now, I have a simple algorithm to autocomplete user inputs (special chars removed):
SELECT * FROM users WHERE full_name LIKE "%input%"
This query returns Etienne with inputs Crécy Etienne
, Etienne De
, Cré
, Cre
, Etienne
I want to add some fuzzy in this query to allow users mispellings. This new algorithm should be able to return Etienne when users write:
Etiene
(similar to first name)Etienne Crecy
(similar to full name, without particule)Crecy Etienne
(similar to full name, without particule, other direction)De Cressi
(sounds like last name)Cressi
(sounds like last name, without particule)
I do a lot of searchs, the most relevant idea is to use SOUNDEX
method (or Metaphone
procedures), or levenstein
procedures. I can not use it as it, because:
- Soundex is based on the first letter, then
SOUNDEX(Cressy)
is not the same asSOUNDEX(De cressy)
, even if they are very similar. - Metaphone is base on the position of the letters (beggining by 'kn' is like begging by 'n', but only in first position)
- levenstein doesn't take care about string length : De Cressy is not similar to Cressy.
Have you any ideas about to "mix" theses methods, or do you have any other idea for me ?
I'd highly recommend trying out Solr or Elasticsearch for what you're asking (along with greater flexibility and better performance).
However, if you want to replicate a basic phonetic search engine inside MySQL you'll need to ability to extract multiple tokens (words or word encodings) per
full_name
while inserting (and per query when auto completing).1). First, make sure your
full_name
column is typeFULLTEXT
. Then switch toMATCH...AGAINST
query syntax in place ofLIKE %foo%
.This will buy you exact inner token matching like "cressy" for "de cressy".
Your idea of using Levenshtein distance as an ordering criterion isn't a bad one, but it's expensive to run, so make sure you've
LIMIT
ed yourMATCH...AGAINST
query andORDER BY MATCH... DESC
if you're going to output levenshtein(query, full_name) as a part of the select.2). If you're still interested in expanding your results to include sound-alikes:
Create a phonetic_token table with a foreign-key column back to your primary name table (it's a one-to-many relationship name-to-tokens).
Add columns
soundex
, andmetaphone
to this table.When inserting records into primary name table, additionally parse them on whitespace, and insert the soundex & metaphone encodings of each name-word into
phonetic_token
.You may want to add some parsing logic to ensure that all first name tripled and last name tripled are recorded (e.g. Before phonetic encoding, make sure that "de cressy" tokenizes to "de", "cressy" and "decressy" in order to match expected input.)
Now when querying for soundalike name-completions to display, you're actually going to be joining your primary name table with
phonetic_token
WHERE soundex IN (list of soundex codes from query tokens) OR metaphone IN (list of metaphone codes from query tokens).I'd suggest running this phonetic match as a second query in the cases where
MATCH(full_name)...AGAINST(query_text)
produces too few results.Again, Solr or Elasticsearch will do all this text-wrangling for you via configuration, while giving you snappier performance. Depending on the scope of your application, pulling text-matching out of MySQL may save you a lot of time and hassle.