Searching strings where Hamming Distance is less than a threshold

1.3k views Asked by At

Currently I work on an application where I have large number of hash values (strings).
When a query hash value (string) is given, the search process goes through those strings and return strings where the Hamming Distance between the query string and the result string is less than a given threshold.

  • Hash values are not binary strings. e.g. "1000302014771944008"
  • All hash values (strings) has the same fixed length.
  • Threshold values is not small (normally t>25) and can be vary.

I want to implement this search process using an efficient algorithm rather than using brute-force approach.
I have read some research papers (like this & this), but they are for binary strings or for low threshold values. I also tried Locality-sensitive hashing, but implementations I found were focused on binary strings.

Are there any algorithms or data structures to address this problem?
Any suggestions are also welcome. Thank you in advance.

.

Additional Information

Hamming Distance between non-binary strings

string 1: 0014479902266110001131133
string 2: 0014409902226110001111133
          -------------------------
               1     1        1    = 3 <-- hamming distance

Considered brute-force approach

  1. calculate Hamming Distance between first hash string and the query hash string.
  2. if Hamming Distance is less than the threshold, then add the hash string to the results list.
  3. repeat step 1 and 2 for all hash strings.
2

There are 2 answers

0
Amit Prakash On

Something like this could work for you.

http://blog.mafr.de/2011/01/06/near-duplicate-detection/

0
R. S. On

Read the 7th section of the paper:

"HmSearch: An Efficient Hamming Distance Query Processing Algorithm".

The state-of-art result for d-query problem can be found at:

"Dictionary matching and indexing with errors and don’t care", which solves d-query problem in time O(m+log(nm)^d+occ) using space O(n*log(nm)^d), where occ is the number of query result.

If threshold values is not smal, there are practical solutions for binary strings, found on HmSearch.

I think it is possible to apply the same practical solutions found on HmSearch for arbitrary strings, but I've never seen those solutions.