Why is Python's BZ2 Decompressor shrinking the block?

73 views Asked by At

The BZ2 file I'm using is a partial dump of Wikipedia [here]

Here's a Python code I wrote to test the length of a 10000-byte block before and after decompression:

import bz2

with open('enwiki-20231020-pages-articles-multistream1.xml-p1p41242.bz2', 'rb') as f:
    block = f.read(10000)
    print(len(block))
    block = bz2.BZ2Decompressor().decompress(block)
    print(len(block))

It outputs:

10000
2560

Indicating that the decompressor is somehow shrinking the block? How is this possible? Everywhere I searched, it's telling me this shouldn't be happening.

1

There are 1 answers

1
blhsing On BEST ANSWER

This is because a bzip2 file may be a concatenation of multiple compressed streams, and bz2.BZ2Decompressor decompresses only the first stream from the input data.

Excerpt from the documentation of bz2.BZ2Decompressor:

Note: This class does not transparently handle inputs containing multiple compressed streams, unlike decompress() and BZ2File. If you need to decompress a multi-stream input with BZ2Decompressor, you must use a new decompressor for each stream.

In your example, the first stream is 2560 bytes long after decompression, and the second stream begins at what's left of the buffer after the decompression of the first stream, stored in the unused_data attribute of the decompressor instance, which you can decompress by instantiating a new bz2.BZ2Decompressor instance as noted in the documentation.

You can therefore implement code that decompresses an entire bzip2 file in 10000-byte chunks by iteratively reading from either unused_data of the current decompressor instance or the next chunk of the file:

import bz2

decompressed = []
with open('enwiki-20231020-pages-articles-multistream1.xml-p1p41242.bz2', 'rb') as f:
    decompressor = bz2.BZ2Decompressor()
    while chunk := decompressor.unused_data or f.read(10000):
        if decompressor.eof:
            decompressor = bz2.BZ2Decompressor()
        decompressed.append(decompressor.decompress(chunk))

print(sum(map(len, decompressed)))

This outputs the total size of the uncompressed data of the given sample bzip2 file:

1018211968

And the actual content of the entire decompressed data will be:

b''.join(decompressed)