Find Min Length Substring Containing All Given Strings

423 views Asked by At

Given a large document and a short pattern consisting of a few words (eg. W1 W2 W3), find the shortest string that has all the words in any order (for eg. W2 foo bar dog W1 cat W3 -- is a valid pattern)

I structured the "large document" as a list of strings. I believe my solution is O(nlog(n)), but I'm not sure (I'm also not sure whether it's correct). Is there a faster way? Please note that the below is pseudocoded Java, so obviously will not compile, but I believe the message is clear:

main(){
    List<String> wordsToCheckFor;
    List<String> allWords;
    int allWordsLength = allWords.length;
    int minStringLength = POS_INFINITY;
    List<String> minString;

    //The idea here is to divide and conquer the string; I will first
    //check the entire string, then the entire string minus the first
    //word, then the entire string minus the first two words, and so on...

    for(int x = 0; x < allWordsLength; x++){
        if(checkString(allWords, wordsToCheckFor) && (allWords.length < minStringLength)){
            minString = allWords;
            minStringLength = allWords.length();
        }   
        allWords.remove(0);
    }

    System.out.println(minString);          
}


checkString(List<String> allWords, List<String> wordsToCheckFor){
    boolean good = true;
    foreach(String word : wordsToCheckFor){
        if(!allWords.contains(word))
            good = false;
    }
    return good;
}
1

There are 1 answers

0
kraskevich On BEST ANSWER

Your solution has O(n ^ 2) time complexity(in the worst case, each suffix is checked, and each check is O(n), because List.contains method has linear time complexity). Moreover, it is not correct: the answer is not always a suffix, it can be any substring.

A more efficient solution: iterate over your text word by word and keep track of the last occurrence of each word from the pattern(using, for example, a hash table). Update the answer after each iteration(a candidate substring is the one from the minimum last occurence among all words in the pattern to the current position). This solution has linear time complexity(under assumption that the number of words in the pattern is a constant).