CONDENSED VERSION
I'm trying to join a new list with my existing database with no unique identifier -- but I'm trying to figure out a way to do it in one query that's more specific than matching by first name/last name but less specific than by all the fields available (first name/middle name/last name/address/phone).
So my idea was to match solely on first/last name and then try to assign each possible matching field with points to see if anyone who matched had 'zero points' and thus have the first name/last name match stripped from them. Here's what I came up with:
SELECT *,
@MidMatch := IF(LEFT(l.middle,1)=LEFT(d.middle,1),"TRUE","FALSE") MidMatch,
@AddressMatch := IF(left(l.address,5)=left(d.address,5),"TRUE","FALSE") AddressMatch,
@PhoneMatch := IF(right(l.phone,4)=right(d.phone,4),"TRUE","FALSE") PhoneMatch,
@Points := IF(@MidMatch = "TRUE",4,0) + IF(@AddressMatch = "TRUE",3,0) + IF(@PhoneMatch = "TRUE",1,0) Points
FROM list l
LEFT JOIN database d on IF(@Points <> 0,(l.first = d.first AND l.last = d.last),(l.first = d.first AND l.last = d.last AND l.address = d.vaddress));
The query runs fine but it does still match people who's first/last names are identical even if their points are zero (and if their addresses don't match).
Is there a way to do what I'm looking for with this roundabout points system? I've found that it helps me a lot when trying to identify which duplicate to choose, so I'm trying to expand it to the initial match. Or should I do something different?
SPECIFIC VERSION
This is kind of a roundabout idea -- so if somebody has something more straight forward, I'd definitely be willing to bail on this completely and try something else. But basically I have a 93k person table (from a database) that I'm matching against a 92k person table (from a new list). I expect many of them to be the same but certainly not all -- and I'm trying to avoid creating duplicates. Unfortunately, there's no unique identifiers that can be matched, so I'm generally stuck with matching based on some variation of first name, middle name, last name, address, and/or phone number.
The schema for the two tables (list and database) are pretty identical with the fields you see above (first name, middle name, last name, address, phone) -- the only difference is that the database table also has an unique numerical ID that I would use to upload back into the database after this match. Unfortunately the list table has no such ID. Records with the ID would get matched and loaded in on top of the old record and any record without that ID would get loaded as a new record.
What I'm trying to avoid with this question is creating a bunch of different tables and queries that start with a really specific JOIN
statement and then eventually get down to just first and last name -- since there's likely some folks who should match but have moved and/or gotten a new phone number since this last list.
I could write a very simple query as a JOIN
and do it numerous times, each time taking out another qualifier:
SELECT *
FROM list l
JOIN database d
ON d.first = l.first AND d.last = l.last AND d.middle = l.middle AND d.address = l.address AND d.phone = l.phone;
And I'd certainly feel confident that those people from the new list matched with the existing people in my database, but it'd only return a very small amount of people, then I'd have to go back and loosen the criteria (e.g. drop the middle name restriction, etc.) and continually create tables then merge them all back together at the end along with all the ones that didn't match at all, which I would assume would be the new people.
But is there a way to write the query solely using a first/last name match, then evaluating the other criteria and wiping the match from people who have zero 'points' (below)? Here's what I attempted to do assigning [arbitrary] points to each match:
SELECT *,
@MidMatch := IF(LEFT(l.middle,1)=LEFT(d.middle,1),"TRUE","FALSE") MidMatch,
@AddressMatch := IF(left(l.address,5)=left(d.address,5),"TRUE","FALSE") AddressMatch,
@PhoneMatch := IF(right(l.phone,4)=right(d.phone,4),"TRUE","FALSE") PhoneMatch,
@Points := IF(@MidMatch = "TRUE",4,0) + IF(@AddressMatch = "TRUE",3,0) + IF(@PhoneMatch = "TRUE",1,0) Points
FROM list l
LEFT JOIN database d on IF(@Points <> 0,(l.first = d.first AND l.last = d.last),(l.first = d.first AND l.last = d.last AND l.address = d.vaddress));
The LEFT
and RIGHT
formulas within the IF
statements are just attempting to control for unstandardized data that gets sent. I also would've done something with a WHERE
statement, but I still need the NULL
values to return so I know who matched and who didn't. So I ended up attempting to use an IF
statement in the LEFT JOIN
to say that if the Points
cell was equal to zero, that the JOIN
statement would get really specific and what I thought would hopefully still return the row but it wouldn't be matched to the database even if their first and last name did.
The query doesn't produce any errors, though unfortunately I'm still getting people back who have zeros in their Points
column but matched with the database because their first and last names matched (which is what I was hoping the IF/Points stuff would stop).
Is this potentially a way to avoid bad matches, or am I going down the wrong path? If this isn't the right way to go, is there any other way to write one query that will return a full LEFT JOIN
along with NULL
s that don't match but have it be more specific than just first/last name but less work than doing a million queries based on a new table each time?
Thanks and hopefully that made some sense!
Your first query:
This is making a serious mistake with regards to variables. The simplest is the
SELECT
-- theSELECT
does not guarantee the order of calculation of expressions, so they could calculated in any order. And the logic is wrong if@Points
is calculated first. This problem is compounded by referring to variables in different clauses. The SQL statement is a logical statement describing the results set, not a programmatic statement of how the query is run.Let me assume that you have a unique identifier for each row in the database (just to identify the row). Then you can get the match by using a correlated subquery:
You can join back to the database table to get more information if you need it.
EDIT:
Your comment made it clear. You don't just want the first and last name but something else as well.