MySQL data matching: better options?

272 views Asked by At

I've got customers and leads coming from different sources and I need to figure out if customer is already registered as a lead.

I use 12 fields for matching:

address1_clear
address2_clear
address_clear
contact_name_clear
email
invoice_mobile
invoice_phone
mobile
name_clear
phone
phone2
taxnum

(_clear suffix means data is in lower case, w/o white space and punctuation).

  • leads - 300k records
  • customers - 500k records
  • customers_leads - 460k records

This is the query used to perform matching:

SELECT l.id as lead_id, c.id as customer_id FROM lead l
INNER JOIN sync_settings s ON s.account_id = l.account_id
INNER JOIN customers c ON c.setting_id = s.id
LEFT JOIN customers_leads cl ON cl.customer_id = c.id AND cl.lead_id = l.id
WHERE cl.lead_id IS NULL AND
(
    (l.phone IS NOT NULL AND l.phone IN (c.phone, c.phone2, c.invoice_phone, c.invoice_mobile)) OR
    (l.mobile IS NOT NULL AND l.mobile != "" AND l.mobile IN (c.phone, c.phone2, c.invoice_phone, c.invoice_mobile)) OR
    (l.invoice_phone IS NOT NULL AND l.invoice_phone != "" AND l.invoice_phone IN (c.phone, c.phone2, c.invoice_phone, c.invoice_mobile)) OR
    (l.invoice_mobile IS NOT NULL AND l.invoice_mobile != "" AND l.invoice_mobile IN (c.phone, c.phone2, c.invoice_phone, c.invoice_mobile)) OR
    (l.email IS NOT NULL AND l.email != "" AND l.email = c.email) OR
    (l.taxnum IS NOT NULL AND l.taxnum != "" AND l.taxnum = c.taxnum) OR
    (l.contact_name_clear IS NOT NULL AND l.contact_name_clear != "" AND l.contact_name_clear = c.contact_name_clear) OR
    (l.address1_clear IS NOT NULL AND l.address1_clear != "" AND l.address1_clear = c.address_clear) OR
    (l.address2_clear IS NOT NULL AND l.address2_clear != "" AND l.address2_clear = c.address_clear) OR
    (l.name_clear IS NOT NULL AND l.name_clear != "" AND l.name_clear IN (c.contact_name_clear, c.name_clear))
)

It is super heavy, response time is ~4 minutes. Indexing didn't help much because of ORs and additional conditions.

I would like to know: is there some better way to do it? Maybe using some NoSQL database to build a huge hash table basically or some data matching technique that I wasn't able to google?

P. S. I'm aware that I can make separate tables purely for matching fields and it will be faster, but still I would like to know my alternatives.

2

There are 2 answers

0
fgregg On

The problem you have is called record linkage and there is no database solution that natively addresses the problem.

There are a number of open source projects that you could use including Duke or dedupe (I'm a main author dedupe).

0
Joel Becker On

Another open source project to consider is recordlinkage (the Python Record Linkage Toolkit). The project's documentation includes an overview of the record linkage process, beginners' code examples, and API-documentation.