Optimally selecting n datapoints from k

105 views Asked by At

Problem statement:

I have 32k strings that consist of 13 characters. Each character can take 3 values (a, b or c). I need to select n strings from the 32k that satisfy the following:

  • select minimal number of strings so that the selected strings are not different than any other string within the 32k by more than 2 characters This means that the count of strings that needs to be selected is variable. Also, the strings are not randomly generated, so the average difference is less than 2/3*13 - meaning that the eventual count of strings to be selected is not astronomical.

What I tried so far:

Clustering with k++ initialization and then k-means using hamming distance - but this did not yield in the desired outcome, albeit the problem resembles a clustering problem in a sense that we are practically looking for cluster centers with cluster members within a radius of 2.

What I am thinking of is simply selecting that string which has the most other strings having a distance of 1 and then of 2 - afterwards take out all these from the 32k and then repeat the calculation until no strings are left, but this is likely to be a suboptimal solution, e.g. this way I would select more strings than what is required at minimum I believe (selecting additional strings is a cost)

Question:

What other algorithms should I consider or think of? Thanks!

3

There are 3 answers

1
gerald On

Here are examples of each method from my previous post. I always have trouble working code into my posts, so I did this separately. The first method computes the percentage that the strings are identical; the second method returns the number of differences.

string1 = ('abcbacaacbaab')
string2 = ('abcbacaacbbbb')

from difflib import SequenceMatcher
a=string1
b=string2
x = SequenceMatcher(a=a,b=b).ratio()
print(x)
#output: 0.8462

#OR (I used pip3 install jellyfish first)   
import jellyfish
x=jellyfish.damerau_levenshtein_distance
        (a,b)  
print(x)  
#output: 2
3
joaopfg On

if I understood, you want the minimum subset such that all elements in the subset are not different by more than two characters to the elements outside of the subset (please, let me know if I misunderstood the problem).

If that is the problem, there is a simple ad hoc algorithm that solves it in O(m * max(n, k)), where n is the total number of elements in the set (32000 in this case), m is the number of characters of an element of the set (13 in this case) and k is the size of the alphabet (3 in this case).

You can precalculate the quantity of each unique character of the alphabet in each column in O(m * max(n, k)). It's O(m * k) for initialization of the precalculation matrix and O(m * n) to actually calculate it.

Each column can vote for the removal of a string of the set if the character of the string in that column is equal to the number of strings in the initial set. Notice that a column can vote in O(1) using the precalculation. For each string, iterate through its columns and let the column vote. If you get three votes, you are sure the string needs to be kicked out of the set. So there is no need to continue iterating through the columns, just go to the next string. Else, the string needs to remain, just append it to the answer.

A python code is attached:

def solve(s: list[str], n: int = 32000, m: int = 13, k: int = 3) -> list[str]:
    pre_calc = [[0 for j in range(k)] for i in range(m)]
    ans = []

    for i in range(n):
        for j in range(m):
            pre_calc[j][ord(s[i][j]) - ord('a')] += 1

    for i in range(n):
        votes_cnt = 0
        remove = False

        for j in range(m):
            if pre_calc[j][ord(s[i][j]) - ord('a')] == n:
                votes_cnt += 1

                if votes_cnt == 3:
                    remove = True
                    break

        if remove is False:
            ans.append(s[i])

    if len(ans) == 0:
        ans.append(s[0])

    return ans 
0
gerald On

You might be able to use one of the types of 'fuzzy string matching' explained at:

https://miguendes.me/python-compare-strings#how-to-compare-two-strings-for-similarity-fuzzy-string-matching

There's "difflib" which computes a ratio of the differences. (You're in luck, your strings are all the same length.) There's also something called "jellyfish" that returns a character count of the differences. It sounds like an interesting assignment, good luck!