Burrows-Wheeler Transform (BWT) - Stored Data

400 views Asked by At

After using BWT, which set of data do we need in the encoded data? Do we need to encode (or export) the Suffix Array?

Input:

stackoverflow

BWT Output:

wtavrcfkle$soo

Suffix Array:

13, 2, 3, 7, 9, 4, 10, 5, 11, 8, 0, 1, 6, 12

5

There are 5 answers

0
richselian On BEST ANSWER

suffix array is only needed to compute bwt transform, after transform done it can be dropped away.

BWT("stackoverflow")="wtavrcfkle$soo"

UNBWT("wtavrcfkle$soo")="stackoverflow"

You can also restore the suffix array from transformed output if you like:)

0
flanglet On

To be clear, the suffix array and the BWT output are the same thing. If you look at the suffix array in your example, it contains the indexes of the letters in the BWT output taken from the BWT input (starting with 1): 13 -> w, 2 -> t, 3 -> a, etc... Using a suffix array is just a mechanism to calculate the output of the BWT in linear time. Transmitting the suffix array or the BWT output means transmitting the same information.

0
rob mayoff On

All you need to invert the transform is the output string (wtavrcfkle$soo in your example).

0
Peter de Rivaz On

You only need to transmit the BWT output.

The surprising thing about this transform is that the original string can be reconstructed from just the permuted output string.

The wikipedia article contains example code for doing this inverse.

Note that the normal mode of operation is to use run length coding to encode the BWT output before transmission (or you have not achieved any compression).

The nice thing about the transform is that it tends to produce long runs of similar characters (if there is structure in the source material) and so the run length coding works well.

0
comingstorm On

To reverse the BWT, you only need the index of the original last character, not the entire suffix array. If you don't have this index, I believe choosing an arbitrary index will result in a rotated version of your original string.

Note that, if you include an end-of-line code (as in your example), the original last character is obvious, so the index doesn't need to be provided separately...