Is a Generalized Suffix Tree a good data structure to use for string searches on a dict of strings where partial matches should also be returned?

29 views Asked by At

I have a dictionary of strings that I would like to perform string searches on in real time (web application with approx. 1500 total users).

Background: I have a data table that follows the structure

C1 C2 C3 C4 C5 C6 C7-13
String String String String String String Bool
  • ~5000 rows (n)
  • Longest String per column: 150 characters
  • ~10000 Unique words
  • Alphabet: Printable ASCII characters
  • Each character in the alphabet is equally likely to appear in each string position.

Search Sudo Algorithm:

  1. On app startup read in the table as a dict = {C1:[0,...n], C2[0,..., n], ...}. 'Data' argument in minimal ex.
  2. User types a key (multi-word accepted) that is used to search through the dict values in columns C1-C6.
  3. Return any rows that contains similar strings to the search.

Minimal Search Example:

from fuzzywuzzy import fuzz


def similarity_sort(key, data):
    result_rows = []
    temp = []

    for index in range(len(data['column 1'])):
        check_r = perform_check(key, "{} {} {} {} {} {}".format(data['column1'][index], ['column2'][index], data['column3'][index], data['column4'][index], data['column5'][index], data['column6'][index]))
        temp.append([index, check_r])

    for x in temp:
        if x[1] > 20:
            result_rows.append(x[0])
    result_rows.sort()
    return result_rows


def perform_check(term, row_value):
    check_ratio = fuzz.token_set_ratio(term.lower(), row_value.lower())
    return check_ratio

With Search Key: 'SQL'

Expected to match with column containing values such as:

  1. SQL

  2. MySQL

  3. SQLdatabase

  4. MySQLdatabase

My question is would a Generalized Suffix Tree be a good fit for my string search problem?. If not are there other data structures I should look into?

0

There are 0 answers