Suppose I have five sets I'd like to cluster. I understand that the SimHashing technique described here:
https://moultano.wordpress.com/2010/01/21/simple-simhashing-3kbzhsxyg4467-6/
could yield three clusters ({A}
, {B,C,D}
and {E}
), for instance, if its results were:
A -> h01
B -> h02
C -> h02
D -> h02
E -> h03
Similarly, the MinHashing technique described in the Chapter 3 of the MMDS book:
http://infolab.stanford.edu/~ullman/mmds/ch3.pdf
could also yield the same three clusters if its results were:
A -> h01 - h02 - h03
B -> h04 - h05 - h06
|
C -> h04 - h07 - h08
|
D -> h09 - h10 - h08
E -> h11 - h12 - h13
(Each set corresponds to a MH signature composed of three "bands", and two sets are grouped if at least one of their signature bands is matching. More bands would mean more matching chances.)
However I have several questions related to these:
(1) Can SH be understood as a single band version of MH?
(2) Does MH necessarily imply the use of a data structure like Union-Find to build the clusters?
(3) Am I right in thinking that the clusters, in both techniques, are actually "pre-clusters", in the sense that they are just sets of "candidate pairs"?
(4) If (3) is true, does it imply that I still have to do an O(n^2)
search inside each "pre-cluster", to partition them further into "real" clusters? (which might be reasonable if I have a lot of small and fairly balanced pre-clusters, not so much otherwise)
SimHash and MinHash are both hashing algorithms that are able to map a set to a list of values which corresponds to the signature of the set.
In case of SimHash the list of values is just a list of bits (the values are either 0 or 1). In case of MinHash a value in the list represents the minimum hash value of all set elements relative to a given hash function which is typically a 32bit or 64bit value.
A major difference of both algorithms is the probability of hash collisions. In case of SimHash it is equal to the cosine similarity and in case of MinHash it is equal to the Jaccard similarity. Depending how you define the similarity between sets the one or the other algorithm could be more appropriate.
Regardless of the chosen hashing algorithm, the values of the calculated signature are equally partitioned over a certain number of bands. If the signatures of any two sets are identical within at least one band, the corresponding pair of sets is selected as candidate for similarity. (This means if n sets have the same signature within a band, there are O(n^2) candidate pairs just from this band.) Estimating the similarity of each candidate pair using the complete signature (including the values from other bands) and keeping only those pairs with an estimated similarity above a given threshold gives you all similar pairs of sets which finally define the final clustering.