I have a weird requirement where I need to find common "Customers" from two different and very large lists. Each entry in both lists is a Customer object that contains the first and last names of the customer and his address (broken down by address lines, like address_line1, address_line2 etc). The issue is that there is a possibility that the data in either list could be incomplete, for example, for one of the records in the first list, the customer's first name could be missing whereas in the second list, for the same customer, his address (line 2 and line 3) could be missing. What I need to do is find customers present in both lists. A point to note is that the lists can be large. Another point to remember is that the names and addresses could semantically be the same but may not return a result when you do an exact String match. For example, in the first list, the address of a customer in the first list could be of the form B-502 ABC Street whereas the address of the same customer in the second list could be in the form B 502 ABC Street. The reason I am using edit distance is to account for user input errors in the list and to handle certain other minor differences in the data present in both lists

What I did was to implement the eq function in the Customer class as follows

import re
import editdistance # Using this: https://pypi.python.org/pypi/editdistance

class Customer:
    def __init__(self, fname, lname, address1, address2, address3, city):
        # Removing special characters from all arguments and converting them to lower case
        self.fname = re.sub("[^a-zA-Z0-9]", "", fname.lower())
        self.lname = re.sub("[^a-zA-Z0-9]", "", lname.lower())
        self.address1 = re.sub("[^a-zA-Z0-9]", "", address1.lower())
        self.address2 = re.sub("[^a-zA-Z0-9]", "", address2.lower())
        self.address3 = re.sub("[^a-zA-Z0-9]", "", address3.lower())
        self.city = re.sub("[^a-zA-Z0-9]", "", city.lower())

    def __eq__(self, other):
        if self.lname == "" or self.lname != other.lname:
            return False

        t = 0

        if self.fname != "" and other.fname != "" and self.fname[0] == other.fname[0]:
            t += 1

        if editdistance.eval(self.fname, other.fname) <= 2:
            t += 3

        if editdistance.eval(self.address1, other.address1) <= 3:
            t += 1

        if editdistance.eval(self.address2, other.address2) <= 3:
            t += 1

        if editdistance.eval(self.address3, other.address3) <= 3:
            t += 1

        if editdistance.eval(self.city, other.city) <= 2:
            t += 1

        if t >= 4:
            return True

        return False

    def __hash__():
        # TODO:  Have a robust implementation of a hash function here. If two objects are "equal", their hashes should be the same

In order to get the customers present in both lists, I would be doing the following:

set(first_list).intersection(set(second_list))

But in order for this to work, the Customer object needs to be hashable.

Could someone help me out with a good hashing mechanism?

1

There are 1 answers

0
gmanjon On

Your only option is normalize data. If you need to compare equality and you may hace different formats the solution is normalization. Transform everything so it will be in the same format in both lists.

I've worked for months in a normalization algorithm for addresses in Spain. The combination of different user inputs for the same address is endless (I was working on a 7 million rows database). Using that distance function may not be accurate enough unless you know exactly the different possible formats for a same address and the distance returned form the function for those differences.

The first key question would be, what's the percentage of error you can afford? Because with user input and big data you will always have some.

Next step would be to measure the percentage of error you will get with that distance algorithm (or any other algorithm). Choose the sample data carefully, so that percent won't vary with full data.

If that percentage suits you use that algorithm, if not, find other algorithms and measure them.