Problem statement:
I have 32k strings that consist of 13 characters. Each character can take 3 values (a, b or c). I need to select n strings from the 32k that satisfy the following:
- select minimal number of strings so that the selected strings are not different than any other string within the 32k by more than 2 characters This means that the count of strings that needs to be selected is variable. Also, the strings are not randomly generated, so the average difference is less than 2/3*13 - meaning that the eventual count of strings to be selected is not astronomical.
What I tried so far:
Clustering with k++ initialization and then k-means using hamming distance - but this did not yield in the desired outcome, albeit the problem resembles a clustering problem in a sense that we are practically looking for cluster centers with cluster members within a radius of 2.
What I am thinking of is simply selecting that string which has the most other strings having a distance of 1 and then of 2 - afterwards take out all these from the 32k and then repeat the calculation until no strings are left, but this is likely to be a suboptimal solution, e.g. this way I would select more strings than what is required at minimum I believe (selecting additional strings is a cost)
Question:
What other algorithms should I consider or think of? Thanks!
Here are examples of each method from my previous post. I always have trouble working code into my posts, so I did this separately. The first method computes the percentage that the strings are identical; the second method returns the number of differences.