How to use multiple threads for zlib compression (same input source)

1.2k views Asked by At

My goal is to compress the data of the same source in parallel threads. I have defined jobs which are in a list, these jobs have the read information(500kb-1MB in each job).

My compressor threads will compress each block's data using ZLIB and store it in the outbuf of the corresponding jobs.

Now, I want to ,merge all this and create one output file which is of standard ZLIB format.

From the ZLIB RFC and after browsing the source of pigzee, I understand that

A ZLIB header is like below

     +---+---+
     |CMF|FLG| (2 bytes)
     +---+---+
     +---+---+---+---+
     |     DICTID    | (4 bytes. Present only when FLG.FDICT is set)
     +---+---+---+---+
     +=====================+
     |...compressed data...| (variable size of data)
     +=====================+
     +---+---+---+---+
     |     ADLER32   |  (4 bytes of variable data)
     +---+---+---+---+

In my case, there is no dictionary as well.

So when I am combining two compressed units, the header of all the units is same.

Hence, I am doing the following operaions.

  1. For the first unit, I am writing the header + compressed data.

  2. For the second unit to the last unit, I am writing only the compressed data (No header and no trailer)

  3. After all the units are done, I am using adlrer32_combine()and converting the checksum's of all the jobs output data into one final adler 32 and then I am updating the output file with it at the bottom.

But the problem is that, I get an error during deflate saying the data is invalid at some places.

Have someone already tried something like this? Any relevant information will be really helpful.

1

There are 1 answers

2
Mark Adler On BEST ANSWER

You cannot simply concatenate raw deflate data streams. Each deflate stream is self-terminating, and so decompression would end at the end of the first stream.

You need to look more carefully at the pigz code for how to merge deflate streams. You can use Z_SYNC_FLUSH to complete the last block and bring it to a byte boundary without ending the deflate stream. Then you can complete the deflate stream, and strip off the final empty block marked as the end block. Except for the last deflate stream which should terminate normally. Then you can concatenate the series of n-1 unterminated deflate streams and the last 1 terminating deflate stream.