Similarity Hash function(simhash)

1.5k views Asked by At

I have a problem with using hash function. I have to assign some number(128 bit or 64 bit) with every word in the document. So, the hash value of "similarity" must be near with "similar". That means, if has value of similarity=>10022(say) then similar=>10025. which should near with similar word. also the hash value of different name should similar. that means, hash value of "john" also should be near about with " michel" or "sita"... so on. If any body have any idea about it.

Thanks in advanced. :)

2

There are 2 answers

3
Ramesh Karna On BEST ANSWER

it's not working in that way , first you have to find the general model for the sample value of available data, and then use it for the streaming log messages.

1
richard On

there is a library called OpenNLP, so by using this library you can know what type of word is it. then as you said that for the similar word like names, there can be write hash function in which name or verbs and so one can get the similar hash value. thanks.