String similarity in c

2.8k views Asked by At

For two strings A and B, we define the similarity of the strings to be the length of the longest prefix common to both strings. For example, the similarity of strings "abc" and "abd" is 2, while the similarity of strings "aaa" and "aaab" is 3. Calculate the sum of similarities of a string S with each of its suffixes

Here is my solution...

#include<stdio.h>
#include<string.h>
int getSim(char str[],int subindex)
{
    int l2=subindex
    int i=0;
    int count=0;
    for(i=0;i<l2;i++)
        if(str[i]==str[subindex])
        {
            count++;
            subindex++;
        }
        else
            break;
    return count;   
}
int main()
{
    int testcase=0;
    int len=0;
    int sum=0;
    int i=0;
    char s[100000];
    scanf("%d",&testcase);
    while(testcase--)
    {
        sum=0;
        scanf("%s",s);
        for(i=0;i<strlen(s);i++)
            if(s[i]==s[0])
            {
                sum=sum+getSim(s,i);
            }
        printf("%d\n",sum);
    }
}

How can we go about solving this problem using suffix array??

2

There are 2 answers

12
corsiKa On

You mention in the comments that it is correct, but it's very slow.

In Java, you can get the length of a String with s.length() - this value is cached in the object and it is O(1) to get.

But when you go to C, you get the length of a string with strlen(s) which recalculates (in O(n)) each time. So while you should be doing O(n), because you have an O(n) operation in there, the entire function becomes O(n^2).

To get around this, cache the value once when you run it. This will bring you back into linear time.

Bad:

scanf("%s",s);
for(i=0;i<strlen(s);i++)
    if(s[i]==s[0])
    {
        sum=sum+getSim(s,i);
    }

Good:

scanf("%s",s);
strlen = strlen(s); /* assume you declared "int strlen" earlier */
for(i=0;i<strlen;i++) /* this is now constant time to run */
    if(s[i]==s[0])
    {
        sum=sum+getSim(s,i);
    }
0
Alexander Putilin On

I'm not sure if it is the best algorithm, but here is the solution.

First of all, build suffix array. The naive algorithm(putting all suffixes into array and then sorting it) is quite slow - O(n^2 * log(n)) operations, there are several algorithms to do this in O(nlogn) time.

I'm assuming that strings are 0-indexed.

  1. Now, take the first letter l in the string s, and use one binary search to find the index i of the first string in the suffix array which starts with l, and another binary search to find the index j of the first string in range [i..n], which doesn't start with l. Then you'll have that all strings in the range [i..j-1] starts with the same letter l. So the answer to the problem is at least j-i.

  2. Then apply similar procedure to the strings in range [i..j). Take the second letter l2, find indexes i2 and j2 corresponding to the first string s[i2] such that s[i2][1] == l2 and the first string s[j2] such that s[j2][1] != l2. Add j2-i2 to the answer.

  3. Repeat this procedure n times, until you run out of letters in the original string. The answer to the problem is j1-i1 + j2-i2 + ... + jn-in