Fuzzy autocomplete

1.1k views Asked by At

In my application, I have a users table, with first_name and last_name. I currently have a third column full_name (automatically generated) like this : first_name + last_name + first_name (without special chars).

"Etienne", "De Crécy", "Etienne De Crecy Etienne"

At now, I have a simple algorithm to autocomplete user inputs (special chars removed):

SELECT * FROM users WHERE full_name LIKE "%input%"

This query returns Etienne with inputs Crécy Etienne, Etienne De, Cré, Cre, Etienne

I want to add some fuzzy in this query to allow users mispellings. This new algorithm should be able to return Etienne when users write:

  • Etiene (similar to first name)
  • Etienne Crecy (similar to full name, without particule)
  • Crecy Etienne (similar to full name, without particule, other direction)
  • De Cressi (sounds like last name)
  • Cressi (sounds like last name, without particule)

I do a lot of searchs, the most relevant idea is to use SOUNDEX method (or Metaphone procedures), or levenstein procedures. I can not use it as it, because:

  • Soundex is based on the first letter, then SOUNDEX(Cressy) is not the same as SOUNDEX(De cressy), even if they are very similar.
  • Metaphone is base on the position of the letters (beggining by 'kn' is like begging by 'n', but only in first position)
  • levenstein doesn't take care about string length : De Cressy is not similar to Cressy.

Have you any ideas about to "mix" theses methods, or do you have any other idea for me ?

1

There are 1 answers

0
Peter Dixon-Moses On

I'd highly recommend trying out Solr or Elasticsearch for what you're asking (along with greater flexibility and better performance).

However, if you want to replicate a basic phonetic search engine inside MySQL you'll need to ability to extract multiple tokens (words or word encodings) per full_name while inserting (and per query when auto completing).

1). First, make sure your full_name column is type FULLTEXT. Then switch to MATCH...AGAINST query syntax in place of LIKE %foo%.

This will buy you exact inner token matching like "cressy" for "de cressy".

Your idea of using Levenshtein distance as an ordering criterion isn't a bad one, but it's expensive to run, so make sure you've LIMITed your MATCH...AGAINST query and ORDER BY MATCH... DESC if you're going to output levenshtein(query, full_name) as a part of the select.

Goal is to avoid running levenshtein on all rows

2). If you're still interested in expanding your results to include sound-alikes:

Create a phonetic_token table with a foreign-key column back to your primary name table (it's a one-to-many relationship name-to-tokens).

Add columns soundex, and metaphone to this table.

When inserting records into primary name table, additionally parse them on whitespace, and insert the soundex & metaphone encodings of each name-word into phonetic_token.

You may want to add some parsing logic to ensure that all first name tripled and last name tripled are recorded (e.g. Before phonetic encoding, make sure that "de cressy" tokenizes to "de", "cressy" and "decressy" in order to match expected input.)

Now when querying for soundalike name-completions to display, you're actually going to be joining your primary name table with phonetic_token WHERE soundex IN (list of soundex codes from query tokens) OR metaphone IN (list of metaphone codes from query tokens).

I'd suggest running this phonetic match as a second query in the cases where MATCH(full_name)...AGAINST(query_text) produces too few results.


Again, Solr or Elasticsearch will do all this text-wrangling for you via configuration, while giving you snappier performance. Depending on the scope of your application, pulling text-matching out of MySQL may save you a lot of time and hassle.