I've got customers and leads coming from different sources and I need to figure out if customer is already registered as a lead.
I use 12 fields for matching:
address1_clear
address2_clear
address_clear
contact_name_clear
email
invoice_mobile
invoice_phone
mobile
name_clear
phone
phone2
taxnum
(_clear
suffix means data is in lower case, w/o white space and punctuation).
- leads - 300k records
- customers - 500k records
- customers_leads - 460k records
This is the query used to perform matching:
SELECT l.id as lead_id, c.id as customer_id FROM lead l
INNER JOIN sync_settings s ON s.account_id = l.account_id
INNER JOIN customers c ON c.setting_id = s.id
LEFT JOIN customers_leads cl ON cl.customer_id = c.id AND cl.lead_id = l.id
WHERE cl.lead_id IS NULL AND
(
(l.phone IS NOT NULL AND l.phone IN (c.phone, c.phone2, c.invoice_phone, c.invoice_mobile)) OR
(l.mobile IS NOT NULL AND l.mobile != "" AND l.mobile IN (c.phone, c.phone2, c.invoice_phone, c.invoice_mobile)) OR
(l.invoice_phone IS NOT NULL AND l.invoice_phone != "" AND l.invoice_phone IN (c.phone, c.phone2, c.invoice_phone, c.invoice_mobile)) OR
(l.invoice_mobile IS NOT NULL AND l.invoice_mobile != "" AND l.invoice_mobile IN (c.phone, c.phone2, c.invoice_phone, c.invoice_mobile)) OR
(l.email IS NOT NULL AND l.email != "" AND l.email = c.email) OR
(l.taxnum IS NOT NULL AND l.taxnum != "" AND l.taxnum = c.taxnum) OR
(l.contact_name_clear IS NOT NULL AND l.contact_name_clear != "" AND l.contact_name_clear = c.contact_name_clear) OR
(l.address1_clear IS NOT NULL AND l.address1_clear != "" AND l.address1_clear = c.address_clear) OR
(l.address2_clear IS NOT NULL AND l.address2_clear != "" AND l.address2_clear = c.address_clear) OR
(l.name_clear IS NOT NULL AND l.name_clear != "" AND l.name_clear IN (c.contact_name_clear, c.name_clear))
)
It is super heavy, response time is ~4 minutes. Indexing didn't help much because of ORs and additional conditions.
I would like to know: is there some better way to do it? Maybe using some NoSQL database to build a huge hash table basically or some data matching technique that I wasn't able to google?
P. S. I'm aware that I can make separate tables purely for matching fields and it will be faster, but still I would like to know my alternatives.
The problem you have is called record linkage and there is no database solution that natively addresses the problem.
There are a number of open source projects that you could use including Duke or dedupe (I'm a main author dedupe).