Trying to store Reverse string pattern into key value pair not working (Burrows Wheeler Rotation)

97 views Asked by At

enter image description here

So I am trying to achieve this index (int) and data (string) into a Dictionary class which takes an index and data of above mentioned types. Here is my code:

for (int i = 0; i < size; i++){
     dict[i].setIndex(i);
     for (int j = i; j <= size; j++){
          dict[i].data += input[j];
     }
}

This code works fine for small text files of <10KiB but when I input large files of texts the loop seems to run forever. It keeps running until the whole memory is used and then crashed the IDE. What am I doing wrong here/or is there a way to optimize this process?

Edit: Here size variable refers to input.length().

1

There are 1 answers

3
Roman Svistunov On

Because you are trying to sort the suffices, you should use the suffix array, which is designed to solve this problem efficiently. Instead of storing the suffices themselves it stores the indices where the suffix starts. Whenever you would try to store all the suffices themselves you would use O(n^2), making such code impossible to run on larger inputs.

It ordered these indices as follows. Instead of sorting suffixes themselves, it sorts the cyclic rotations of the string. Let's extend the word's substring meaning to work with cyclic strings, allowing us to use starting positions after the ending ones. Notice that any substring of size 2k can be expressed as a concatenation of two substrings of size k. Therefore, under the assumption that we have already sorted all substrings of size k, we can sort substrings of size 2k by comparing a pair with just two comparisons, for each half of the substring. Therefore, doubling the processed substring length can be done in O(n log n) time if using comparison-based sorts, or, in this case, even in O(n) using, for example, counting sort. Sorting substrings of length one is trivial.

Therefore, the final algorithm would be this: sort all the substrings of size one. Then until you have sorted long enough strings, double the size of the sorted substrings. This doubling can repeat only O(log n) times, meaning that the whole algorithm runs in O(n log n) time and uses O(n) space. You would end up with an array of indices of starting positions of suffices, ordered by the suffix they represent. With such representation, you can easily get the last character of the rotated string (ans[(suffixIndex + n - 1) % n]), or any other part of the string.

This page has more details about this algorithm and provides an implementation in C++ language