I need to join two tables based on names. And the problem is that names may be a slight mispelling in one of the database. I have remedy this problem in the past using Stata and Python's fuzzy merging, where names are matched based on how closely similar they are, but I am wondering if this is possible to do in Postgresql.
For example, may data may be something similar to this:
Table A:
first_name_a | last_name_a | id_a ---------------------------------- William | Hartnell | 1 Matt | Smithe | 2 Paul | McGann | 3 David | Tennant | 4 Colin | Baker | 5
Table B:
first_name_b | last_name_b | id_b ---------------------------------- Matt | Smith | a Peter | Davison | b Dave | Tennant | c Colin | Baker | d Will | Hartnel | e
And in the end, I hope my results would look something like:
first_name_a | last_name_a | id_a | first_name_b | last_name_b | id_b ---------------------------------------------------------------------- William | Hartnell | 1 | Will | Hartnel | e Matt | Smithe | 2 | Matt | Smith | a Paul | McGann | 3 | | | David | Tennant | 4 | Dave | Tennant | c Colin | Baker | 5 | Colin | Baker | d | | | Peter | Davison | b
My Sonic Screwdriver gives me some pseudo-code like this:
SELECT a.*, b.* FROM A a JOIN B b WHERE LEVENSHTEIN(first_name_a, first_name_b) IS LESS THAN 1 AND LEVENSHTEIN(last_name_a, last_name_b) IS LESS THAN 1
Done up as an SQLFiddle. Play with the thresholds/look at some of the other mapping functions mentioned in matching fuzzy strings.