Longest common substring from more than two strings - C++

2.5k views Asked by At

I need to compute the longest common substrings from a set of filenames in C++.

Precisely, I have an std::list of std::strings (or the QT equivalent, also fine)

char const *x[] = {"FirstFileWord.xls", "SecondFileBlue.xls", "ThirdFileWhite.xls", "ForthFileGreen.xls"};
std::list<std::string> files(x, x + sizeof(x) / sizeof(*x));

I need to compute the n distinct longest common substrings of all strings, in this case e.g. for n=2

 "File" and ".xls"

If I could compute the longest common subsequence, I could cut it out it and run the algorithm again to get the second longest, so essentially this boils down to:

Is there a (reference?) implementation for computing the LCS of a std::list of std::strings?


This is not a good answer but a dirty solution that I have - brute force on a QList of QUrls from which only the part after the last "/" is taken. I'd love to replace this with "proper" code.

(I have discovered http://www.icir.org/christian/libstree/ - which would help greatly, but I can't get it to compile on my machine. Someone used this maybe?)

QString SubstringMatching::getMatchPattern(QList<QUrl> urls)
    {
    QString a;

    int foundPosition = -1;
    int foundLength = -1;
    for (int i=urls.first().toString().lastIndexOf("/")+1; i<urls.first().toString().length(); i++)
    {
        bool hit=true;
        int xj;
        for (int j=0; j<urls.first().toString().length()-i+1; j++ ) // try to match from position i up to the end of the string :: test character at pos. (i+j)
        {
            if (!hit) break;

            QString firstString = urls.first().toString().right( urls.first().toString().length()-i ).left( j ); // this needs to match all k strings
            //qDebug() << "SEARCH " << firstString;

            for (int k=1; k<urls.length(); k++) // test all other strings, k = test string number
            {
                if (!hit) break;

                //qDebug() << " IN  " << urls.at(k).toString().right(urls.at(k).toString().length() - urls.at(k).toString().lastIndexOf("/")+1);
                //qDebug() << " RES " << urls.at(k).toString().indexOf(firstString, urls.at(k).toString().lastIndexOf("/")+1);
                if (urls.at(k).toString().indexOf(firstString, urls.at(k).toString().lastIndexOf("/")+1)<0) {
                    xj = j;
                    //qDebug() << "HIT LENGTH " << xj-1 << " : " << firstString;
                    hit = false;
                }
            }

        }
        if (hit) xj = urls.first().toString().length()-i+1; // hit up to the end of the string
        if ((xj-2)>foundLength) // have longer match than existing, j=1 is match length
        {
            foundPosition = i; // at the current position
            foundLength = xj-1;
            //qDebug() << "Found at " << i << " length " << foundLength;
        }
    }

    a = urls.first().toString().right( urls.first().toString().length()-foundPosition ).left( foundLength );
    //qDebug() << a;
    return a;
}
1

There are 1 answers

0
Mike Kinghan On

If as you say suffix trees are too heavyweight or otherwise impractical, the following fairly simple brute-force approach may be adequate for your application.

I assume distinct substrings shall be non-overlapping and are picked from left to right.

Even with these assumptions, there need not be a unique set that comprises "the N distinct longest common substrings" of a set of strings. Whatever N is, there might be more than N distinct common substrings all of the same maximal length and any choice of N from among them would be arbitrary. Accordingly the solution finds the at-most N *sets* of the longest distinct common substrings in which all those of the same length are one set.

The algorithm is as follows:

  • Q is the target quota of lengths.

  • Strings is the problem set of strings.

  • Results is an initially empty multimap that maps a length to a set of strings, Results[l] being the set with length l

  • N, initially 0, is the number of distinct lengths represented in Results

  • If Q is 0 or Strings is empty return Results

  • Find any shortest member of Strings; keep a copy of it S and remove it from Strings. We proceed by comparing the substrings of S with those of Strings because all the common substrings of {Strings, S} must be substrings of S.

  • Iteratively generate all the substrings of S, longest first, using the obvious nested loop controlled by offset and length. For each substring ss of S:

    • If ss is not a common substring of Strings, next.

    • Iterate over Results[l] for l >= the length of ss until end of Results or until ss is found to be a substring of the examined result. In the latter case, ss is not distinct from a result already in hand, so next.

    • ss is common substring distinct from any already in hand. Iterate over Results[l] for l < the length of ss, deleting each result that is a substring of ss, because all those are shorter than ss and not distinct from it. ss is now a common substring distinct from any already in hand and all others that remain in hand are distinct from ss.

    • For l = the length of ss, check whether Results[l] exists, i.e. if there are any results in hand the same length as ss. If not, call that a NewLength condition.

    • Check also if N == Q, i.e. we have already reached the target quota of distinct lengths. If NewLength obtains and also N == Q, call that a StickOrRaise condition.

    • If StickOrRaise obtains then compare the length of ss with l = the length of the shortest results in hand. If ss is shorter than l then it is too short for our quota, so next. If ss is longer than l then all the shortest results in hand are to be ousted in favour of ss, so delete Results[l] and decrement N.

    • Insert ss into Results keyed by its length.

    • If NewLength obtains, increment N.

    • Abandon the inner iteration over substrings of S that have the same offset of ss but are shorter, because none of them are distinct from ss.

    • Advance the offset in S for the outer iteration by the length of ss, to the start of the next non-overlapping substring.

  • Return Results.

Here is a program that implements the solution and demonstrates it with a list of strings:

#include <list>
#include <map>
#include <string>
#include <iostream>
#include <algorithm>

using namespace std;

// Get a non-const iterator to the shortest string in a list
list<string>::iterator shortest_of(list<string> & strings)
{
    auto where = strings.end();
    size_t min_len = size_t(-1);
    for (auto i = strings.begin(); i != strings.end(); ++i) {
        if (i->size() < min_len) {
            where = i;
            min_len = i->size();
        }
    }
    return where;
}

// Say whether a string is a common substring of a list of strings
bool 
is_common_substring_of(
    string const & candidate, list<string> const & strings)
{
    for (string const & s : strings) {
        if (s.find(candidate) == string::npos) {
            return false;
        }
    }
    return true;
}


/* Get a multimap whose keys are the at-most `quota` greatest 
    lengths of common substrings of the list of strings `strings`, each key 
    multi-mapped to the set of common substrings of that length.
*/
multimap<size_t,string> 
n_longest_common_substring_sets(list<string> & strings, unsigned quota)
{
    size_t nlengths = 0;
    multimap<size_t,string> results;
    if (quota == 0) {
        return results;
    }
    auto shortest_i = shortest_of(strings);
    if (shortest_i == strings.end()) {
        return results;
    }
    string shortest = *shortest_i;
    strings.erase(shortest_i);
    for ( size_t start = 0; start < shortest.size();) {
        size_t skip = 1;
        for (size_t len = shortest.size(); len > 0; --len) {
            string subs = shortest.substr(start,len);
            if (!is_common_substring_of(subs,strings)) {
                continue;
            }
            auto i = results.lower_bound(subs.size());
            for (   ;i != results.end() && 
                    i->second.find(subs) == string::npos; ++i) {}
            if (i != results.end()) {
                continue;
            }
            for (i = results.begin(); 
                    i != results.end() && i->first < subs.size(); ) {
                if (subs.find(i->second) != string::npos) {
                    i = results.erase(i);
                } else {
                    ++i;
                }
            }
            auto hint = results.lower_bound(subs.size());
            bool new_len = hint == results.end() || hint->first != subs.size();
            if (new_len && nlengths == quota) {
                size_t min_len = results.begin()->first;
                if (min_len > subs.size()) {
                    continue;
                }
                results.erase(min_len);
                --nlengths;
            }
            nlengths += new_len;
            results.emplace_hint(hint,subs.size(),subs);
            len = 1;
            skip = subs.size();
        }
        start += skip;
    }
    return results; 
}

// Testing ...

int main()
{
    list<string> strings{
        "OfBitWordFirstFileWordZ.xls", 
        "SecondZWordBitWordOfFileBlue.xls", 
        "ThirdFileZBitWordWhiteOfWord.xls", 
        "WordFourthWordFileBitGreenZOf.xls"};

    auto results = n_longest_common_substring_sets(strings,4);
    for (auto const & val : results) {
        cout << "length: " << val.first 
        << ", substring: " << val.second << endl;
    }
    return 0;
}

Output:

length: 1, substring: Z
length: 2, substring: Of
length: 3, substring: Bit
length: 4, substring: .xls
length: 4, substring: File
length: 4, substring: Word

(Built with gcc 4.8.1)