SimHash function details

97 views Asked by At

When researching the SimHash algorithm for checking similarities between two documents, a few questions sprung up:

  • When hashing text documents, an often criteria for the feature vector representation is commonality of words — is there any other well functioning example of the feature vector representation?
  • When hashing text documents, every implementation I found removed common stop words. Does this mean that for every language there should be a different SimHash?
  • Does SimHash only work on text documents? Can I hash binary data and expect it to work just as well (with the right feature vector representation)?
1

There are 1 answers

0
otmar On

The SimHash algorithm allows the computation of fingerprints (also called signatures) for sets of elements. These fingerprints can then be used to estimate the cosine similarity of the original sets. Thus, the SimHash algorithm is not limited to text documents. It can be used for any object that can be mapped to a set representation if the corresponding cosine similarity is a meaningful measure of object similarity.

GPS routes, for example, could be represented as a set of cells in a rasterized map. The cosine similarity between sets of cells could be a measure of the similarity of different GPS routes.

A common method for mapping text documents to sets is tokenization, in which the text is decomposed into words or n-grams. Removing stop words that are likely to occur in each text document can increase the contrast of the cosine similarity.