Make a Sim Hash (Locality Sensitive Hashing) Algorithm more accurate?

3.2k views Asked by At

I have 'records' (basically CSV strings) of two names and and one address. I need to find records that are similar to each other: basically the names and address portions all look 'alike' as if they were interpreted by a human.

I used the ideas from this excellent blog post: http://knol.google.com/k/simple-simhashing# to write a simple SimHash. If the results of the SimHash for two or more strings are the same, I pass all records of this subset to a fine-grained matching program that is O(n^2) which compares every record of the set to every other record.

For the SimHash portion, I have parameters where I can define the datagram size (basically a sliding window of size n over the strings) and the number of iterations to use to determine how many (random) hashes I need to use for the SimHash calculation. So far a datagram size of 4 and using 4 hashes to compute the SimHash. I have tried various other combinations, but this one yields the best results so far.

The issue I'm running into is that this method finds about 80% of the duplicates in the data sets I have. I know this because I have verified the entire data set against the painfully slow O(n^2) full match mentioned above. The O(n^2) matcher is OK for data sets of less than 10^4, but quickly becomes unfeasable, since I need to run sets of size 10^8.

Any ideas, suggestions or thoughts on how I can increase the accuracy of the SimHash so more of the 'similar' records are tagged with the same SimHash number?

EDIT: Prior to SimHashing, I capitalize and remove all ![0-9A-Z] characters. Examples of things that should match (spelling mistakes are intentional):


  • JOHN SMITH, 123 ANY STREET SOMETOWN ZIP
  • JOHNNY SMITH, 123 ANY STRET
  • SOMETOWNE ZIP ROBERT PARKER, 442 ANY STREET SOMETOWN ZIP

Here 1 and 2 are similar, 3 is not. Output should be: 1 + 2

3

There are 3 answers

1
Rob Neuhaus On BEST ANSWER

Before you try to be fancy and change the sim hash, have you tried applying domain specific knowledge to the problem?

Do you have a list of missed pairs for your algorithm? Is there anything they have in common?

Have you tried doing things like removing capitalization, converting nick names to full names, dropping middle names, expanding N, E, S, W and north, south, east, west, expanding st to street, etc?

5
Harrison F On

(I'd put the below in a comment but don't have the rep yet.)

What ultimately are you trying to do? Find all duplicates? How are you defining duplicates? Case sensitivity matters? Similar wording?

I am slightly confused about how you are going about this - finding similar records and creating a set, but then later O(n^2) checking for what I assume is exact equality. If you are checking for exact equality, then that seems to defeat the purpose of finding similar records (unless you are using that as a filter for your O(n^2) to save time.

Some random thoughts: Run each record through a sort of sanitizer that attempts to convert each record to the most general form (if you care / this matters).

If exact equality is what you are interested in, and memory is not a restriction, but you are looking for speed, you could just create a Java object for each record. Define the .equals() for each record (you could always customize this to not do exact equality). You will then need to define a hashCode() for this object. Then you can stick each record in a HashSet.

The resulting HashSet will have no duplicates (as defined by your .equals() / .hashCode() implementation).

Or if you want to find the duplicates, then before you add to the HashSet, check to see if it contains the record first, if it does - then you found a duplicate.

This implementation would be very fast, but could potentially use a LOT of memory as you would be storing the entire data set in memory. Alternatives to this would be to create a hash for each record and then store that in the HashSet and check the hashes for each record for equality.

The downside of doing a hash for each record is the challenge of developing a good hash generation with good distribution AND then of course with hashes you have to worry about false positives with collisions. But if your hashing algorithm is solid, then the chances for collision should be so rare that you shouldn't really worry about it.

Some thoughts about hashes you could do are something as simple as a MD5 of the concatenation of all fields. You could do a checksum. Or you could take the sum of the hashCode for each field. I'm not a super math genius so I can't tell you which would have the best distribution behavior and thus result in the least likely chance for collisions. Might be worth researching if you decide to go this route.

0
Ben Whitmore On

Simhash is not a suitable algorithm for this purpose as it's only useful for near-duplicate detection in which differences are very minor and the vast proportion of features are identical. See my tutorial on simhash and solving the hamming distance problem.

A better approach would be minhash, possibly with LSH. It looks as though your features for hashing on would best be generated as shingles of characters (with length of perhaps 4), rather than shingles of words.

Given such short text fields, and given that word orders are probably not likely to change much, you should consider including terminating shingles as well: shingles from the start and end of a text field that contain fewer than the normal number of characters, plus a terminating mark. This tends to be more lenient toward spelling differences on short runs of text, e.g. "Whitmore" and "Whitemore" without terminating shingles would yield

[ WHIT, HITM, ITMO, TMOR, MORE ] and [ WHIT, HITE, ITEM, TEMO, EMOR, MORE ] with low Jaccard similarity of 2/9;

whereas with terminating shingles included these would yield

[ #W, #WH, #WHI, WHIT, HITM, ITMO, TMOR, MORE, ORE#, RE#, E# ] and [ #W, #WH, #WHI, WHIT, HITE, ITEM, TEMO, EMOR, MORE, ORE#, RE#, E# ] with higher Jaccard similarity of 8/15;

Rob Neuhaus's suggestions on pre-normalizing are very sensible. I would normalize long-form words down to their abbreviations (e.g. "Saint James Street" would be normalized to "ST JAMES ST"). Normalizing in the other direction may be difficult with sometimes ambiguous abbreviations ("St" --> "STREET" or "SAINT"?), and also, the abbreviated forms contribute to fewer shingles and thus have less influence on overall similarity, which is good, because people often mistype "Road" for "Street", etc, and it doesn't change the meaning much.