The BZ2 file I'm using is a partial dump of Wikipedia [here]
Here's a Python code I wrote to test the length of a 10000-byte block before and after decompression:
import bz2
with open('enwiki-20231020-pages-articles-multistream1.xml-p1p41242.bz2', 'rb') as f:
block = f.read(10000)
print(len(block))
block = bz2.BZ2Decompressor().decompress(block)
print(len(block))
It outputs:
10000
2560
Indicating that the decompressor is somehow shrinking the block? How is this possible? Everywhere I searched, it's telling me this shouldn't be happening.
This is because a bzip2 file may be a concatenation of multiple compressed streams, and
bz2.BZ2Decompressor
decompresses only the first stream from the input data.Excerpt from the documentation of
bz2.BZ2Decompressor
:In your example, the first stream is 2560 bytes long after decompression, and the second stream begins at what's left of the buffer after the decompression of the first stream, stored in the
unused_data
attribute of the decompressor instance, which you can decompress by instantiating a newbz2.BZ2Decompressor
instance as noted in the documentation.You can therefore implement code that decompresses an entire bzip2 file in 10000-byte chunks by iteratively reading from either
unused_data
of the current decompressor instance or the next chunk of the file:This outputs the total size of the uncompressed data of the given sample bzip2 file:
And the actual content of the entire decompressed data will be: