How to calculate similarity of two texts with Jaccard similarity of two bag via MinHash?

586 views Asked by At

I have the following two text:

text0 = "AAAAAAAAAAAA";

text1 = "AAAAABAAAAAA";

I use 4-shingle. Thus, text0 = {AAAA}, text1 = {AAAA, AAAB, AABA, ABAA, BAAA}.

Then, the Jaccard similarity is sim = 1/5 = 0.2.


I do not want this result. Because the two text seems having high similar.

I want to use bag similarity as following:

text0 = {AAAA, AAAA, AAAA, AAAA, AAAA, AAAA, AAAA, AAAA, AAAA},

text1 = {AAAA, AAAA, AAAB, AABA, ABAA, BAAA, AAAA, AAAA, AAAA}.

If use this two bags, its similar is sim = 5/9. This is much high than 0.2.

Does MinHash can do this one?

2

There are 2 answers

4
otmar On

For bags you can use weighted minwise hashing, see

S. Ioffe, Improved consistent sampling, weighted minhash and l1 sketching, 2010

or

A. Shrivastava, Simple and Efficient Weighted Minwise Hashing, 2016.

If the multiplicities are always small integral numbers you could also use unweighted min-wise hashing by making entries unique, e.g. through numbering:

text0 = {AAAA1, AAAA2, AAAA3, AAAA4, AAAA5, AAAA6, AAAA7, AAAA8, AAAA9},

text1 = {AAAA1, AAAA2, AAAB1, AABA1, ABAA1, BAAA1, AAAA3, AAAA4, AAAA5}.

0
Ben Whitmore On

Another simple solution to improve your similarity score with very short texts is to also generate shorter shingles at beginning and end of document, using a special character to indicate beginning/end.

In this case, your shingles generated from text0 are: {@A, @AA, @AAA, AAAA, AAA@, AA@, A@}

and those from text1 are: {@A, @AA, @AAA, AAAA, AAAB, AABA, ABAA, BAAA, AAA@, AA@, A@}.

Jaccard similarity is now 7/11 = 0.64

This really comes down to a philosophical question about what "similarity" means to you: which features do you or don't you consider important to include?