I need to compute the longest common substrings from a set of filenames in C++.
Precisely, I have an std::list of std::strings (or the QT equivalent, also fine)
char const *x[] = {"FirstFileWord.xls", "SecondFileBlue.xls", "ThirdFileWhite.xls", "ForthFileGreen.xls"};
std::list<std::string> files(x, x + sizeof(x) / sizeof(*x));
I need to compute the n distinct longest common substrings of all strings, in this case e.g. for n=2
"File" and ".xls"
If I could compute the longest common subsequence, I could cut it out it and run the algorithm again to get the second longest, so essentially this boils down to:
Is there a (reference?) implementation for computing the LCS of a std::list of std::strings?
This is not a good answer but a dirty solution that I have - brute force on a QList of QUrls from which only the part after the last "/" is taken. I'd love to replace this with "proper" code.
(I have discovered http://www.icir.org/christian/libstree/ - which would help greatly, but I can't get it to compile on my machine. Someone used this maybe?)
QString SubstringMatching::getMatchPattern(QList<QUrl> urls)
{
QString a;
int foundPosition = -1;
int foundLength = -1;
for (int i=urls.first().toString().lastIndexOf("/")+1; i<urls.first().toString().length(); i++)
{
bool hit=true;
int xj;
for (int j=0; j<urls.first().toString().length()-i+1; j++ ) // try to match from position i up to the end of the string :: test character at pos. (i+j)
{
if (!hit) break;
QString firstString = urls.first().toString().right( urls.first().toString().length()-i ).left( j ); // this needs to match all k strings
//qDebug() << "SEARCH " << firstString;
for (int k=1; k<urls.length(); k++) // test all other strings, k = test string number
{
if (!hit) break;
//qDebug() << " IN " << urls.at(k).toString().right(urls.at(k).toString().length() - urls.at(k).toString().lastIndexOf("/")+1);
//qDebug() << " RES " << urls.at(k).toString().indexOf(firstString, urls.at(k).toString().lastIndexOf("/")+1);
if (urls.at(k).toString().indexOf(firstString, urls.at(k).toString().lastIndexOf("/")+1)<0) {
xj = j;
//qDebug() << "HIT LENGTH " << xj-1 << " : " << firstString;
hit = false;
}
}
}
if (hit) xj = urls.first().toString().length()-i+1; // hit up to the end of the string
if ((xj-2)>foundLength) // have longer match than existing, j=1 is match length
{
foundPosition = i; // at the current position
foundLength = xj-1;
//qDebug() << "Found at " << i << " length " << foundLength;
}
}
a = urls.first().toString().right( urls.first().toString().length()-foundPosition ).left( foundLength );
//qDebug() << a;
return a;
}
If as you say suffix trees are too heavyweight or otherwise impractical, the following fairly simple brute-force approach may be adequate for your application.
I assume distinct substrings shall be non-overlapping and are picked from left to right.
Even with these assumptions, there need not be a unique set that comprises "the N distinct longest common substrings" of a set of strings. Whatever N is, there might be more than N distinct common substrings all of the same maximal length and any choice of N from among them would be arbitrary. Accordingly the solution finds the at-most N *sets* of the longest distinct common substrings in which all those of the same length are one set.
The algorithm is as follows:
Q is the target quota of lengths.
Strings is the problem set of strings.
Results is an initially empty multimap that maps a length to a set of strings, Results[l] being the set with length l
N, initially 0, is the number of distinct lengths represented in Results
If Q is 0 or Strings is empty return Results
Find any shortest member of Strings; keep a copy of it S and remove it from Strings. We proceed by comparing the substrings of S with those of Strings because all the common substrings of {Strings, S} must be substrings of S.
Iteratively generate all the substrings of S, longest first, using the obvious nested loop controlled by offset and length. For each substring ss of S:
If ss is not a common substring of Strings, next.
Iterate over Results[l] for l >= the length of ss until end of Results or until ss is found to be a substring of the examined result. In the latter case, ss is not distinct from a result already in hand, so next.
ss is common substring distinct from any already in hand. Iterate over Results[l] for l < the length of ss, deleting each result that is a substring of ss, because all those are shorter than ss and not distinct from it. ss is now a common substring distinct from any already in hand and all others that remain in hand are distinct from ss.
For l = the length of ss, check whether Results[l] exists, i.e. if there are any results in hand the same length as ss. If not, call that a NewLength condition.
Check also if N == Q, i.e. we have already reached the target quota of distinct lengths. If NewLength obtains and also N == Q, call that a StickOrRaise condition.
If StickOrRaise obtains then compare the length of ss with l = the length of the shortest results in hand. If ss is shorter than l then it is too short for our quota, so next. If ss is longer than l then all the shortest results in hand are to be ousted in favour of ss, so delete Results[l] and decrement N.
Insert ss into Results keyed by its length.
If NewLength obtains, increment N.
Abandon the inner iteration over substrings of S that have the same offset of ss but are shorter, because none of them are distinct from ss.
Advance the offset in S for the outer iteration by the length of ss, to the start of the next non-overlapping substring.
Return Results.
Here is a program that implements the solution and demonstrates it with a list of strings:
Output:
(Built with gcc 4.8.1)