Clarification needed about min/sim hashing + LSH

1.1k views Asked by At

I have a reasonable understanding of a technique to detect similar documents consisting in first computing their minhash signatures (from their shingles, or n-grams), and then use an LSH-based algorithm to cluster them efficiently (i.e. avoid the quadratic complexity which would entail a naive pairwise exhaustive search).

What I'm trying to do is to bridge three different algorithms, which I think are all related to this minhash + LSH framework, but in non-obvious ways:

(1) The algorithm sketched in Section 3.4.3 of Chapter 3 of the book Mining of Massive Datasets (Rajaraman and Ullman), which seems to be the canonical description of minhashing

(2) Ryan Moulton's Simple Simhashing article

(3) Charikar's so-called SimHash algorihm, described in this article

I find this confusing because what I assume is that although (2) uses the term "simhashing", it's actually doing minhashing in a way similar to (1), but with the crucial difference that a cluster can only be represented by a single signature (even tough multiple hash functions might be involved), while two documents have more chances of being similar with (1), because the signatures can collide in multiple "bands". (3) seems like a different beast altogether, in that the signatures are compared in terms of their Hamming distance, and the LSH technique implies multiple sorting of the signatures, instead of banding them. I also saw (somewhere else) that this last technique can incorporate a notion of weighting, which can be used to put more emphasis on certain document parts, and which seems to lack in (1) and (2).

So at last, my two questions:

(a) Is there a (satisfying) way in which to bridge those three algorithms?

(b) Is there a way to import this notion of weighting from (3) into (1)?

1

There are 1 answers

0
Ben Whitmore On

Article 2 is actually discussing minhash, but has erroneously called it simhash. That's probably why it is now deleted (it's archived here). Also, confusingly, it talks about "concatenating" multiple minhashes, which as you rightly observe reduces the chance of finding two similar documents. It seems to prescribe an approach that yields only a single "band", which will give very poor results.

Can the algorithms be bridged/combined?

Probably not. To see why, you should understand what the properties of the different hashes are, and how they are used to avoid n2 comparisons between documents.

Properties of minhash and simhash:

Essentially, minhash generates multiple hashes per document, and when there are two similar documents it is likely that a subset of these hashes will be identical. Simhash generates a single hash per document, and where there are two similar documents it is likely that their simhashes will be similar (having a small hamming distance).

How they solve the n2 problem:

With minhash you index all hashes to the documents that contain them; so if you are storing 100 hashes per document, then for each new incoming document you need to look up each of its 100 hashes in your index and find all documents that share at least (e.g.) 50% of them. This could mean building large temporary tallies, as hundreds of thousands of documents could share at least one hash.

With simhash there is a clever technique of storing each document's hash in multiple permutations in multiple sorted tables, such that any similar hash up to a certain hamming distance (3, 4, 5, possibly as high as 6 or 7 depending on hash size and table structure) is guaranteed to be found nearby in at least one of these tables, differing only in the low order bits. This makes searching for similar documents efficient, but restricts you to only finding very similar documents. See this simhash tutorial.

Because the use of minhash and simhash are so different, I don't see a way to bridge/combine them. You could theoretically have a minhash that generates 1-bit hashes and concatenate them into something like a simhash, but I don't see any benefit in this.

Can weighting be used in minhash?

Yes. Think of the minhashes as slots: if you store 100 minhashes per document, that's 100 slots you can fill. These slots don't all have to be filled from the same selection of document features. You could devote a certain number of slots to hashes calculated just from specific features (care should be taken, though, to always select features in such a way that the same features will be selected in similar documents).

So for example if you were dealing with journal articles, you could assign some slots for minhashes of document title, some for minhashes of article abstract, and the remainder of the slots for minhashes of the document body. You can even set aside some individual slots for direct hashes of things such as author's surname, without bothering about the minhash algorithm.

It's not quite as elegant as how simhash does it, where in effect you're just creating a bitwise weighted average of all the individual feature hashes, then rounding each bit up to 1 or down to 0. But it's quite workable.