Currently I work on an application where I have large number of hash values (strings).
When a query hash value (string) is given, the search process goes through those strings and return strings where the Hamming Distance between the query string and the result string is less than a given threshold.
- Hash values are not binary strings. e.g. "
1000302014771944008
" - All hash values (strings) has the same fixed length.
- Threshold values is not small (normally
t>25
) and can be vary.
I want to implement this search process using an efficient algorithm rather than using brute-force approach.
I have read some research papers (like this & this), but they are for binary strings or for low threshold values. I also tried Locality-sensitive hashing, but implementations I found were focused on binary strings.
Are there any algorithms or data structures to address this problem?
Any suggestions are also welcome. Thank you in advance.
.
Additional Information
Hamming Distance between non-binary strings
string 1: 0014479902266110001131133
string 2: 0014409902226110001111133
-------------------------
1 1 1 = 3 <-- hamming distance
Considered brute-force approach
- calculate Hamming Distance between first hash string and the query hash string.
- if Hamming Distance is less than the threshold, then add the hash string to the results list.
- repeat step 1 and 2 for all hash strings.
Something like this could work for you.
http://blog.mafr.de/2011/01/06/near-duplicate-detection/