Reliably detecting a substring pattern in string building

112 views Asked by At

I'm trying to do a programming task but I'm failing to meet the performance requirements. Here is the full description of it:

The language model takes an n-letter word S and a parameter k (an integer with 1 ≤ k < n) as input and then generates a continuation of that word based on the rules specified below.

Assuming we already have a word S′ that is an extension of S by certain letters, adding a new letter works as follows (also see the example below): we consider the k-letter suffix R of the word S′ and look at all previous occurrences of R in the word S′ (as contiguous substrings). Then, for each letter in the alphabet, we count how many times it occurred directly after R in the word S′. Let l be the letter that occurred most frequently. Ties are resolved in favor of the letter occurring earlier in the alphabet, and if R did not occur anywhere else in the word S′, then we assume l = 'a'. Finally, we extend the word S′ by appending the letter l to its end.

For example, let S = "abaaabababa" and k = 3. We have S′ = S, R = aba, and R occurs earlier with the next letter as abaa, abab, abab. The most frequent letter is b, so we append b to S′. Now S′ = "abaaabababab", R = bab, and R occurs earlier with the next letter as baba, baba, so we append 'a' to S′.

Input: The first line of input contains four integers n, k, a, and b (2 ≤ n ≤ 10^6, 1 ≤ k < n, n < a < b < 10^18, b + 1 − a ≤ 10^6). The second line of input contains an n-letter string consisting of lowercase English alphabet characters ('a' – 'z'), representing the word S.

Output: The output should be a sequence of b + 1 − a characters, representing the letters in the extended word S′ at positions from the a-th to the b-th (inclusive). In other words, assuming bn letters have been added to the initial word S, you want to print the last b + 1 − a of those added letters.

Example: For the input data:

11 3 12 13 abaaabababa

the correct result is:

ba

I've managed to create a program that builds the string correctly but fails to do so in reasonable time when (b - n) - the number of letters we assume are added - is great. b can be as big as 10^18 and n can be only as big as 10^6, leaving 10^18 - 10^6 characters to generate. On the other hand, we only care about b - a + 1 characters, which is at most only a million and i fit in time to do it. I believe I'm thinking wrong on how to approach this problem. I'm certain we actually shouldn't generate the whole string as all of the tests i ran seem to generate a pattern that starts sooner or later. Maybe we should just generate the needed amount of characters and rotate the string (roll it)? Any kind of help is appreciated. If you believe you know a resource that could help me, I'll be glad. I'm having hard time searching for a suitable solution. The task is about an imaginary chatbot, maybe some machine learning algorithm would do the trick?

Here is my simplified code (in the original one i use a radix trie with custom hashing but this is not important I believe, both programs generate correct results).

[[nodiscard]] static std::string Model(
    const size_t n,
    const size_t k,
    const size_t a,
    const size_t b,
    std::string& word) noexcept(true)
{
    /* k-lettered substring -> character after the substring -> count */
    std::unordered_map<std::string, std::unordered_map<char, size_t>> occurences;

    for (size_t i{ 0U }; i < n - k; ++i)
        /* Extract a k-lettered substring at position i -> extract the letter after it -> increate count */
        ++occurences[word.substr(i, k)][word[i + k]];

    std::string lettersToPrint{};
    for (size_t i{ 0U }; i < (b - n) /* Do we really have to generate all of them? */; ++i)
    {
        /* k-lettered substring at the end of the "word". */
        std::string suffix(word.substr(word.length() - k, k));

        /* Detect the most popular character occurring after each suffix substring. */
        size_t mostPopularSuffixAppendLetterCount{ 0U };
        char mostPopularSuffixAppendLetter{ 'a' };

        for (char j{ 0 }; j < 26; ++j)
        {
            // careful! works with ASCII!!!
            const char currentCharacter{ char('z' - j) };
            const auto currentCharacterCount{ occurences[suffix][currentCharacter] };

            if (currentCharacterCount >= mostPopularSuffixAppendLetterCount)
            {
                mostPopularSuffixAppendLetterCount = currentCharacterCount;
                mostPopularSuffixAppendLetter = currentCharacter;
            }
        }

        word += mostPopularSuffixAppendLetter;

        if ((b - n) - i <= b + 1 - a)
            lettersToPrint += mostPopularSuffixAppendLetter;
    }

    return lettersToPrint;
}

int main()
{
    size_t n{ 65 }, k{ 2 }, a{ 67 }, b{ 783 };
    std::string word{ "ffdedffddfdefedddfdfddfddeedfdededddffeefffdeedfeddeddddfeefdffee" };

    assert(Model(n, k, a, b, word) == "dfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddfdeddf");

    return 0;
}
1

There are 1 answers

0
Alois Christen On

As this is for homeworks, I won't give you the full answer, to let you understand by yourself.

Here are some pointers :

  1. Is there a repetition that appears in the string continuation ?
  2. Can you prove it ?
  3. If there is a repetition, what it's maximum length ?
  4. Can you detect where the repetition starts without generating the whole continuation ?
  5. Can you use these informations to find the continuation in the last (b + 1 - a) characters ?