fuzzy merging two tables postgresql

2.7k views Asked by At

I need to join two tables based on names. And the problem is that names may be a slight mispelling in one of the database. I have remedy this problem in the past using Stata and Python's fuzzy merging, where names are matched based on how closely similar they are, but I am wondering if this is possible to do in Postgresql.

For example, may data may be something similar to this:

Table A:

first_name_a | last_name_a | id_a
----------------------------------
   William   |   Hartnell  | 1
   Matt      |   Smithe    | 2
   Paul      |   McGann    | 3
   David     |   Tennant   | 4
   Colin     |   Baker     | 5

Table B:

first_name_b | last_name_b | id_b
----------------------------------
   Matt      |   Smith     | a
   Peter     |   Davison   | b
   Dave      |   Tennant   | c
   Colin     |   Baker     | d
   Will      |   Hartnel   | e

And in the end, I hope my results would look something like:

first_name_a | last_name_a | id_a | first_name_b | last_name_b | id_b
----------------------------------------------------------------------
 William     |  Hartnell   |  1   |  Will        | Hartnel     | e
 Matt        |  Smithe     |  2   |  Matt        | Smith       | a
 Paul        |  McGann     |  3   |              |             | 
 David       |  Tennant    |  4   |  Dave        | Tennant     | c
 Colin       |  Baker      |  5   |  Colin       | Baker       | d
             |             |      |  Peter       | Davison     | b

My Sonic Screwdriver gives me some pseudo-code like this:

SELECT a.*, b.* FROM A a
     JOIN B b
     WHERE LEVENSHTEIN(first_name_a, first_name_b) IS LESS THAN 1
     AND LEVENSHTEIN(last_name_a, last_name_b) IS LESS THAN 1
2

There are 2 answers

0
gwaigh On

Done up as an SQLFiddle. Play with the thresholds/look at some of the other mapping functions mentioned in matching fuzzy strings.

1
Nick Weedon On

The DML you mention:

SELECT a.*, b.* FROM A a
 JOIN B b
 WHERE LEVENSHTEIN(first_name_a, first_name_b) IS LESS THAN 1
 AND LEVENSHTEIN(last_name_a, last_name_b) IS LESS THAN 1

Looks correct, just bump up the 'fuzziness' (given 'IS LESS THAN 1' substitute 1 for the 'fuzzyness' level that you you require)

See http://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html for reference info on LEVENSHTEIN.