Burrows-Wheeler Transform (BWT) repeating string

1.3k views Asked by At

I'm writing Burrows-Wheeler Transform and its reverse in Python. It works fine for small strings, but it fell apart when I tested a bigger string. At certain points, the string seems to loop over. I'm sure it must have to do with the final loop of the decoder, but I'm following the steps found on more than one website. My implementation is as follows:

class BurrowsWheelerTransform:

    def __init__(self, data):
        self.data = data

    def transform(self):
        # get data size
        size = len(self.data)
        # get order (by index) of rotations
        order = sorted(range(size), key=lambda i: self.data[i:])
        # get index of original rotation
        index = order.index(0)
        # return size appended with last column of (imaginary) rotation table
        return chr(255) * (index // 255) + chr(index % 255) + ''.join(self.data[(i - 1 + size) % size] for i in order)

    def restore(self):
        # get index of end of index
        eoi = next(i for i in range(len(self.data)) if ord(self.data[i]) < 255)
        # get index
        index = 255 * eoi + ord(self.data[eoi])
        # get tranformed content
        content = self.data[eoi + 1:]
        # get lshift array
        lshift = [i - 1 for symbol in sorted(set(content)) for i, x in enumerate(self.data) if x == symbol]
        # restore
        restored = ''
        for i in range(len(content)):
            index = lshift[index]
            restored += content[index]
        # return restored
        return restored

Original string:

Unwilling to obtrude himself on the princess, Rostv did not go back to the house but remained in the village awaiting her departure. When her carriage drove out of the house, he mounted and accompanied her eight miles from Boguchrovo to where the road was occupied by our troops. At the inn at Yankvo he respectfully took leave of her, for the first time permitting himself to kiss her hand.

How can you speak so! he blushingly replied to Princess Marys expressions of gratitude for her deliverance, as she termed what had occurred. Any police officer would have done as much! If we had had only peasants to fight, we should not have let the enemy come so far, said he with a sense of shame and wishing to change the subject. I am only happy to have had the opportunity of making your acquaintance. Good-by, Princess. I wish you happiness and consolation and hope to meet you again in happier circumstances. If you dont want to make me blush, please dont thank me!

Decoded string:

Unwilling to obtrude himself on the princess, Rostv did not go back to the house but remained in the village awaiting her departure. When her carriage drove out of the house, he mounted and accompanied her eight miles from Boguchrovo to where the road was occupied by our troops. At the inn at Yankvo he respectfully took leave of her, for the first time permitting himself to kiss her hand.

How can you speak so!Unwilling to obtrude himself on the princess, Rostv did not go back to the house but remained in the village awaiting her departure. When her carriage drove out of the house, he mounted and accompanied her eight miles from Boguchrovo to where the road was occupied by our troops. At the inn at Yankvo he respectfully took leave of her, for the first time permitting himself to kiss her hand.

How can you speak so!Unwilling to obtrude himself on the princess, Rostv did not go back to the house but remained in the village awaiting her departure. When

Bizarrely enough, it seems that this happens for other implementations I found on the web and tested, like this one and this one. What is going on? Have I misunderstood how the transform works? Or is this implementation incorrect?

2

There are 2 answers

0
cabralpinto On BEST ANSWER

Found the answer! This algorithm is based on the assumption that the string is concatenated to one final character, which is unique and lexicographically smaller than any other character in the string. However, as I intend to utilize any value in the 0-255 range for any given character, no extra symbols are at my disposal. Luckily, thanks to John Kurlak and some extra bug fixing, I've been able to slightly modify my initial implementation to account for this fact. Here it is:

class BurrowsWheelerTransform:

    def __init__(self, data):
        self.data = data

    def transform(self):
        # get data size
        size = len(self.data)
        # get doubled string
        self.data *= 2
        # get order (by index) of rotations
        order = sorted(range(size), key=lambda i: self.data[i:])
        # get index of original rotation
        index = order.index(0)
        # return index appended with last column of (imaginary) rotation table
        return chr(255) * (index // 255) + chr(index % 255) + ''.join(self.data[(i - 1 + size) % size] for i in order)

    def restore(self):
        # get index of end of index
        eoi = next(i for i in range(len(self.data)) if ord(self.data[i]) < 255)
        # get index
        index = 255 * eoi + ord(self.data[eoi])
        # get tranformed content
        content = self.data[eoi + 1:]
        size = len(content)
        # get lshift array
        lshift = [i for symbol in sorted(set(content)) for i, x in enumerate(content) if x == symbol]
        # restore
        restored = ''
        for i in range(size):
            index = lshift[index]
            if index >= size: break
            restored += content[index]
        # return restored
        return restored

Thanks, John!

0
Angie On

I faced same problem with BWT sorting algorithm in C++, thanks to your comment i quickly fixed it by using a 16-bit array instead of 8-bit to allow value higher than 0xFF (255) for the last character (also had to modify BWT sorting algorithm code a little bit to make it accept 16-bit data). It works perfectly now, thank you.