Optimization: Count the number of suffix pairs

1k views Asked by At

Question: Count the total number of pairs in a list of strings, so that for each pair, one string is the suffix of the other. For example, "door" is the suffix of "backdoor", "a" is the suffix of "cba".

``1st Version: Slow

def count_same_endings(words):
    """
    Count the number of word pairs with the same ending.

    Parameters:
    words (list): List of words.

    Returns:
    int: The count of word pairs with the same ending.
    """
    strings = sorted([word[::-1] for word in words])
    count = 0
    to_print = []
    i = 0
    while i < len(words):
        j = 1
        while i + j < len(words) and strings[i + j].startswith(strings[i]):
            # to_print.append((strings[i+j][::-1], strings[i][::-1]))
            count += 1
            j += 1
        i += 1
    # print(to_print)
    return count

words = ['cba', 'a', 'a', 'a', 'a', 'b', 'ba', 'ca']
result = count_same_endings(words)
print(result)
19

``2nd Version: Faster

def count_end_slices(strings):
    slice_count = {}
    strings_count = {}
    count = 0

    # Iterate over each word in the list
    for string in strings:
        length = len(string)

        # Check if this word is already the suffix of some previous words
        if string in slice_count:
            count += slice_count[string]

        # Generate slices that include the end of the string
        for start in range(length):
            slice = string[start:]

            # Count the occurrences of each slice
            if slice in slice_count:
                # Check if previous words are the suffix of this string
                if slice in strings_count:
                    count += strings_count[slice] if slice != string else 0
                slice_count[slice] += 1
            else:
                slice_count[slice] = 1

        # Account for this word
        if string in strings_count:
            strings_count[string] += 1
        else:
            strings_count[string] = 1

    return slice_count, count

``Comparison between 1st and 2nd

I have already created two algorithms which are both pretty brute force. I wonder if there is a more optimized way of solving this question.

4

There are 4 answers

5
MBo On

We can join all strings with symbol "$" (it is less than all alphabetic symbols) into string common, then build suffix array over common string, I've used implementation from here.

Then sort source strings, and look for every string+"$" in suffix array. Sorted suffixes and sorted strings allow to walk through both arrays once.

When string is larger lexically than current suffix, we move to the next suffix.

When string is equal to current suffix (more exactly, to its prefix of corresponding length), we increment counter and move to the next suffix.

When string is lexically smaller than current suffix, we move to the next string.

At the end we subtract number of strings from counter (to remove coincidences of strings with themselves).

from collections import defaultdict

def sort_bucket(s, bucket, order):
    d = defaultdict(list)
    for i in bucket:
        key = s[i + order // 2:i + order]
        d[key].append(i)
    result = []
    for k, v in sorted(d.items()):
        if len(v) > 1:
            result += sort_bucket(s, v, 2 * order)
        else:
            result.append(v[0])
    return result


def suffix_array_ManberMyers(s):
    return sort_bucket(s, range(len(s)), 1)

strs = ["back", "backdoor", "cba", "ba", "or", "door"]  #4
strs = ['cba', 'a', 'a', 'a', 'a', 'b', 'ba', 'ba', 'ca']  #25
strs = ['cba', 'a', 'a', 'a', 'a', 'b', 'ba', 'ca']      #19
strs = ['cba', 'a', 'a', 'ba'] #6
strs = ['cba', 'a', 'a', 'a', 'a', 'b', 'ba', 'ba', 'ca', 'a', 'a', 'a', 'a', 'b', 'ba', 'ba', 'ca'] #96
strs = ['dcba', 'cba', 'a', 'a', 'a', 'a', 'b', 'ba', 'ba', 'ca', 'a', 'a', 'a', 'a', 'b', 'ba', 'ba', 'ca'] #109


def cnt_suff(strs):
    common = "$".join(strs)+"$"
    print(common)
    sa = suffix_array_ManberMyers(common)
    print(sa)

    lensa = len(sa)
    lenstrs = len(strs)
    strs.sort()
    i = 0
    strcnt = 1
    cnt = 0
    while i < lenstrs-1 and strs[i]==strs[i+1]:
        i += 1
        strcnt += 1
    t = strs[i]+"$"
    cnt -= strcnt*(strcnt+1)//2
    lt = len(t)
    k = 0
    while k < lensa:
        subs = common[sa[k]:sa[k]+lt]
        if t > subs:
            k+=1
        elif t == subs:
            print(subs, strcnt)
            cnt+=strcnt
            k+=1
        else:
            i+=1
            if i==lenstrs:
                break
            strcnt = 1
            while i < lenstrs-1 and strs[i]==strs[i+1]:
                i += 1
                strcnt += 1
            cnt -= strcnt*(strcnt+1)//2
            t = strs[i] + "$"
            lt = len(t)
    return (cnt)

def dumb(strs:list[str]):
    c = 0
    for i in range(len(strs)-1):
        for j in range(i+1, len(strs)):
            if strs[i].endswith(strs[j]) or strs[j].endswith(strs[i]):
                c += 1
    return c

print(cnt_suff(strs), dumb(strs))
3
nice_dev On

Mine is a Trie version to store the strings. We store the strings in a reversed form along with a cnt for each node indicating the no. of strings that ended at this node acting as a suffix for the current string at hand.

While counting, we also maintain a set for not including duplicate pairs for duplicate strings.

Snippet:

class Node:
    def __init__(self, val):
        self.val = val
        self.kids = dict()
        self.cnt = 0
        
def countSuffixPairs(words):
    for i in range(len(words)):
        words[i] = words[i][::-1] # reverse all for suffixes
    def insertInTrie(root, s):
        temp = root
        for i in range(len(s)):
            if s[i] not in temp.kids:
                temp.kids[ s[i] ] = Node(s[i])
            temp = temp.kids[ s[i] ]
        temp.cnt += 1
    def countPairsInTrie(root, s, fullWordIncluded):
        temp = root
        cnt = 0
        for i in range(len(s)):
            if s[i] not in temp.kids:
                break
            temp = temp.kids[ s[i] ]
            if i == len(s) - 1:
               if s not in fullWordIncluded:
                   fullWordIncluded.add(s)
                   cnt += (temp.cnt * (temp.cnt - 1)) // 2
            else:
                cnt += temp.cnt                   
        return cnt
    root = Node('~')
    for word in words: 
        insertInTrie(root, word)
    ans = 0
    fullWordIncluded = set()
    for word in words: 
        ans += countPairsInTrie(root, word, fullWordIncluded)
    return ans


print(countSuffixPairs(['cba', 'a', 'a', 'a', 'a', 'b', 'ba', 'ca'])) // 19

Live Online Demo

1
Matt Timmermans On

There are a lot of ways to solve this problem in linear time, but in python, performance is often about leveraging the fast built-ins instead of executing slow python code.

Your fast implementation is pretty fast, I think. I came up with a few versions that were a little faster, but none that were a lot faster.

Of the ones that were a little faster, this one is the simplest. It runs a little faster than yours (using my test data) even though it includes an O(n log n) sort:

def fast(strings):
    #reverse all the strings to turn suffixes into prefixes
    sorted = [s[::-1] for s in strings]
    sorted.sort()
    # a prefix that matches i characters in the previous
    # string matches matchCounts[i-1] words
    matchCounts = []
    prevstr = ''
    count = 0
    for s in sorted:
        for i in range(min(len(prevstr), len(s))):
            if s[i] != prevstr[i]:
                del matchCounts[i:]
                break
            count+=matchCounts[i]
        while len(matchCounts) < len(s):
            matchCounts.append(0)
        matchCounts[len(s)-1] += 1
        prevstr = s
    return count

This is a somewhat optimized version of your algorithm that runs a little faster than both of them:

def faster(strings):
    suffixIds = {}
    counts = [0]
    wordcounts = [0]
    wordids = []
    for s in strings:
        id = 0
        for i in range(len(s)-1, -1, -1):
            key = s[i:]
            id = suffixIds.get(key)
            if id == None:
                id = len(counts)
                counts.append(1)
                wordcounts.append(0)
                suffixIds[key] = id
            else:
                counts[id] += 1
        wordids.append(id)
        wordcounts[id] += 1
    return sum([counts[id] for id in wordids]) - sum([n * (n+1) // 2 for n in wordcounts])
0
Tejas Paulzagade On

There is room for improvement, and you can optimize the solution further. Let me provide you with an optimized version based on your 2nd version, addressing some inefficiencies and improving the overall readability:

def count_end_slices(strings):
    slice_count = {}
    count = 0

    for string in strings:
        length = len(string)

        # Check if this word is already the suffix of some previous words
        if string in slice_count:
            count += slice_count[string]

        # Generate slices that include the end of the string
        for start in range(1, length):
            slice = string[start:]

            # Count the occurrences of each slice
            if slice in slice_count:
                count += slice_count[slice]

            # Update slice_count
            slice_count[slice] = slice_count.get(slice, 0) + 1

    return count

Changes and improvements:

  1. Removed the unnecessary strings_count dictionary, as it was not being used for anything useful.
  2. Removed the inner loop that iterated over the entire length of the string and started the loop from start = 1 since an empty slice is not useful in this context.
  3. Used slice_count.get(slice, 0) + 1 to handle the case where the slice is not in the dictionary yet, avoiding the need for an explicit check.

This optimized version should provide better performance and maintain the correctness of the solution.