I have the following two text:
text0 = "AAAAAAAAAAAA";
text1 = "AAAAABAAAAAA";
I use 4-shingle. Thus, text0 = {AAAA}, text1 = {AAAA, AAAB, AABA, ABAA, BAAA}.
Then, the Jaccard similarity is sim = 1/5 = 0.2.
I do not want this result. Because the two text seems having high similar.
I want to use bag similarity as following:
text0 = {AAAA, AAAA, AAAA, AAAA, AAAA, AAAA, AAAA, AAAA, AAAA},
text1 = {AAAA, AAAA, AAAB, AABA, ABAA, BAAA, AAAA, AAAA, AAAA}.
If use this two bags, its similar is sim = 5/9. This is much high than 0.2.
Does MinHash can do this one?
For bags you can use weighted minwise hashing, see
S. Ioffe, Improved consistent sampling, weighted minhash and l1 sketching, 2010
or
A. Shrivastava, Simple and Efficient Weighted Minwise Hashing, 2016.
If the multiplicities are always small integral numbers you could also use unweighted min-wise hashing by making entries unique, e.g. through numbering:
text0 = {AAAA1, AAAA2, AAAA3, AAAA4, AAAA5, AAAA6, AAAA7, AAAA8, AAAA9},
text1 = {AAAA1, AAAA2, AAAB1, AABA1, ABAA1, BAAA1, AAAA3, AAAA4, AAAA5}.