MySQL: Remove JOIN for Matched Row if 2nd Round of Criteria Not Met

128 views Asked by At

CONDENSED VERSION

I'm trying to join a new list with my existing database with no unique identifier -- but I'm trying to figure out a way to do it in one query that's more specific than matching by first name/last name but less specific than by all the fields available (first name/middle name/last name/address/phone).

So my idea was to match solely on first/last name and then try to assign each possible matching field with points to see if anyone who matched had 'zero points' and thus have the first name/last name match stripped from them. Here's what I came up with:

SELECT *, 
@MidMatch := IF(LEFT(l.middle,1)=LEFT(d.middle,1),"TRUE","FALSE") MidMatch, 
@AddressMatch := IF(left(l.address,5)=left(d.address,5),"TRUE","FALSE") AddressMatch, 
@PhoneMatch := IF(right(l.phone,4)=right(d.phone,4),"TRUE","FALSE") PhoneMatch,  
@Points := IF(@MidMatch = "TRUE",4,0) + IF(@AddressMatch = "TRUE",3,0) + IF(@PhoneMatch = "TRUE",1,0) Points
    FROM list l
    LEFT JOIN database d on IF(@Points <> 0,(l.first = d.first AND l.last = d.last),(l.first = d.first AND l.last = d.last AND l.address = d.vaddress));

The query runs fine but it does still match people who's first/last names are identical even if their points are zero (and if their addresses don't match).

Is there a way to do what I'm looking for with this roundabout points system? I've found that it helps me a lot when trying to identify which duplicate to choose, so I'm trying to expand it to the initial match. Or should I do something different?


SPECIFIC VERSION

This is kind of a roundabout idea -- so if somebody has something more straight forward, I'd definitely be willing to bail on this completely and try something else. But basically I have a 93k person table (from a database) that I'm matching against a 92k person table (from a new list). I expect many of them to be the same but certainly not all -- and I'm trying to avoid creating duplicates. Unfortunately, there's no unique identifiers that can be matched, so I'm generally stuck with matching based on some variation of first name, middle name, last name, address, and/or phone number.

The schema for the two tables (list and database) are pretty identical with the fields you see above (first name, middle name, last name, address, phone) -- the only difference is that the database table also has an unique numerical ID that I would use to upload back into the database after this match. Unfortunately the list table has no such ID. Records with the ID would get matched and loaded in on top of the old record and any record without that ID would get loaded as a new record.

What I'm trying to avoid with this question is creating a bunch of different tables and queries that start with a really specific JOIN statement and then eventually get down to just first and last name -- since there's likely some folks who should match but have moved and/or gotten a new phone number since this last list.

I could write a very simple query as a JOIN and do it numerous times, each time taking out another qualifier:

SELECT * 
FROM list l
JOIN database d
ON d.first = l.first AND d.last = l.last AND d.middle = l.middle AND d.address = l.address AND d.phone = l.phone;

And I'd certainly feel confident that those people from the new list matched with the existing people in my database, but it'd only return a very small amount of people, then I'd have to go back and loosen the criteria (e.g. drop the middle name restriction, etc.) and continually create tables then merge them all back together at the end along with all the ones that didn't match at all, which I would assume would be the new people.

But is there a way to write the query solely using a first/last name match, then evaluating the other criteria and wiping the match from people who have zero 'points' (below)? Here's what I attempted to do assigning [arbitrary] points to each match:

SELECT *, 
@MidMatch := IF(LEFT(l.middle,1)=LEFT(d.middle,1),"TRUE","FALSE") MidMatch, 
@AddressMatch := IF(left(l.address,5)=left(d.address,5),"TRUE","FALSE") AddressMatch, 
@PhoneMatch := IF(right(l.phone,4)=right(d.phone,4),"TRUE","FALSE") PhoneMatch,  
@Points := IF(@MidMatch = "TRUE",4,0) + IF(@AddressMatch = "TRUE",3,0) + IF(@PhoneMatch = "TRUE",1,0) Points
    FROM list l
    LEFT JOIN database d on IF(@Points <> 0,(l.first = d.first AND l.last = d.last),(l.first = d.first AND l.last = d.last AND l.address = d.vaddress));

The LEFT and RIGHT formulas within the IF statements are just attempting to control for unstandardized data that gets sent. I also would've done something with a WHERE statement, but I still need the NULL values to return so I know who matched and who didn't. So I ended up attempting to use an IF statement in the LEFT JOIN to say that if the Points cell was equal to zero, that the JOIN statement would get really specific and what I thought would hopefully still return the row but it wouldn't be matched to the database even if their first and last name did.

The query doesn't produce any errors, though unfortunately I'm still getting people back who have zeros in their Points column but matched with the database because their first and last names matched (which is what I was hoping the IF/Points stuff would stop).

Is this potentially a way to avoid bad matches, or am I going down the wrong path? If this isn't the right way to go, is there any other way to write one query that will return a full LEFT JOIN along with NULLs that don't match but have it be more specific than just first/last name but less work than doing a million queries based on a new table each time?

Thanks and hopefully that made some sense!

1

There are 1 answers

7
Gordon Linoff On BEST ANSWER

Your first query:

SELECT *, 
       @MidMatch := IF(LEFT(l.middle,1)=LEFT(d.middle,1),"TRUE","FALSE") MidMatch, 
       @AddressMatch := IF(left(l.address,5)=left(d.address,5),"TRUE","FALSE") AddressMatch, 
       @PhoneMatch := IF(right(l.phone,4)=right(d.phone,4),"TRUE","FALSE") PhoneMatch,  
       @Points := IF(@MidMatch = "TRUE",4,0) + IF(@AddressMatch = "TRUE",3,0) + IF(@PhoneMatch = "TRUE",1,0) Points
    FROM list l LEFT JOIN
         database d
         on IF(@Points <> 0,(l.first = d.first AND l.last = d.last),(l.first = d.first AND l.last = d.last AND l.address = d.vaddress));

This is making a serious mistake with regards to variables. The simplest is the SELECT -- the SELECT does not guarantee the order of calculation of expressions, so they could calculated in any order. And the logic is wrong if @Points is calculated first. This problem is compounded by referring to variables in different clauses. The SQL statement is a logical statement describing the results set, not a programmatic statement of how the query is run.

Let me assume that you have a unique identifier for each row in the database (just to identify the row). Then you can get the match by using a correlated subquery:

select l.*,
       (select d.databaseid
        from database d
        where l.first = d.first and l.last = d.last
        order by (4 * (LEFT(l.middle, 1) = LEFT(d.middle, 1) ) +
                  3 * (left(l.address, 5) = left(d.address, 5)) +
                  1 * (right(l.phone, 4) = right(d.phone, 4))
                 )
        limit 1
       ) as did
from list l;

You can join back to the database table to get more information if you need it.

EDIT:

Your comment made it clear. You don't just want the first and last name but something else as well.

select l.*,
       (select d.databaseid
        from database d
        where l.first = d.first and l.last = d.last and
              (LEFT(l.middle, 1) = LEFT(d.middle, 1) or
               left(l.address, 5) = left(d.address, 5) or
               right(l.phone, 4) = right(d.phone, 4)
              )                     
        order by (4 * (LEFT(l.middle, 1) = LEFT(d.middle, 1) ) +
                  3 * (left(l.address, 5) = left(d.address, 5)) +
                  1 * (right(l.phone, 4) = right(d.phone, 4))
                 )
        limit 1
       ) as did
from list l;