Performance issue while getting all the words with common prefix in Prefix Tree

428 views Asked by At

I have a prefix tree to store a huge collection of words. Right now, if i want to find all the words with the common prefix say "a", I first retrieve the first node containing a and then exhaustively search in the Depth First fashion in the children nodes of the first node. While this idea looks naive and straightforward, it in-fact is pathetically slow if the possible number of words with common prefix is VERY HIGH(>20K). Is there some other way to retrieve all the words starting with a common prefix efficiently? Or should i adopt some other data structure? Thanking you in advanced.

EDIT1 Basically I am creating a full word by visiting every node and adding character incrementally. All the words are later stored in a vector container. And yes, i have recursive implementation.

EDIT2

vector<int> getNonEmptyEdgeIndices(Node* parent) {
    vector<int> indices;
    for(int i=0; i<EDGE; i++) {
        if (parent->edges[i] != NULL) {
            indices.push_back(i);
        }
    }
    return indices; 
}

vector<string> getSubsequentStrings(vector<string> wordsStartingWith, Node* node, string prefix) {
    vector<int> indices = getNonEmptyEdgeIndices(node);

    // push the word to the container if node is a leaf 
    if (indices.empty()) {
        wordsStartingWith.push_back(prefix);
        return wordsStartingWith;
    }

    // if frequency is set in node, push the word but still continue recursion
    if (node->frequency != 0) {
        wordsStartingWith.push_back(prefix);
    }

    // look all the children of the node
    for(unsigned int i=0; i<indices.size(); i++) {
        string newPrefix = prefix + getNodeChar(indices[i]);
        Node* child = node->edges[indices[i]];

        // recursively get the prefix for all children
        wordsStartingWith = getSubsequentStrings(wordsStartingWith, child, newPrefix);  
    }

    return wordsStartingWith;
}

vector<string> Trie::getWordsStartingWith(string prefix) {
    vector<string> wordsStartingWith;
    Node* lastNode = getLastNode(prefix);

    if (lastNode != NULL) {
        wordsStartingWith = getSubsequentStrings(wordsStartingWith, lastNode, prefix);
    }
    return wordsStartingWith;
}

EDIT 3 SOLVED!!! There was actually a problem with my implementation. I was passing this huge vector string container in recursive calls which was in-fact the problem. Thank you everyone for your kind advice.

1

There are 1 answers

0
Timothy On

Actually TRIE+DFT is already a good enough solution to your case. Its time complexity is O(M+B^M) where M is the max length of a word and B is the constant number of possible letters (usually B=26). Although it's exponential, but in practice it could be much faster than you think, because the TRIE tree is very sparse and the M is a small number.

A simpler (not guaranteed better) solution is to sort all the words into an array. Then you could get what you want by binary searching the array for the first and the last word who has the target prefix, just like how you use an English dictionary. The sorting takes O(NlogN) and the searching takes O(MlogN) where N is the number of words. It's polynomial.

If you really Need for Speed, then you can almost always pay memory space for exchange. In this case, you could link each word to all its prefix nodes by pointers during building the TRIE tree. Then the time complexity would be reduced to O(M+N) which is extremely fast. But on the other hand it will take a space complexity of O(NM). Supposing you have a million words with 5 letters on average, you would spend about 150KB memory on the pointers.