understanding this LZSS based decompression algorithm

1.2k views Asked by At

I want to understand this LZSS based algorithm in order to write a compressor (and maybe a better decompressor), I am studying LZ77 and LZSS but there are few lines that I still don't get.

I commented the following code as much as I can, the first comment block is not made by me and explains the header.

//
// ULZ header (20 bytes)
// ---------------------
// 0x0000 - 4 bytes: signature "ULZ" + 0x1A
// 0x0004 - 2 bytes: 0
// 0x0006 - 1 byte : mode? (2/normal LZ, 0/unkown LZ type)
// 0x0007 - 1 byte : nbits for LZ algorithm
// 0x0008 - 4 bytes: offset to decompressed data (aligned to 32 bits)
// 0x000C - 4 bytes: offset to compressed data (aligned to 32 bits)
// 0x0010 - 4 bytes: size of decompressed file
//
// ULZ body (aligned to 32 bits)
// -----------------------------
// 0x0014 - X bytes: flags data
// ?????? - X bytes: decompressed data (pointer in 0x0008)
// ?????? - X bytes: compressed data (pointer in 0x000C)
//

// in:
//   source - ULZ buffer (a valid ULZ file with header)
//   target - data buffer for the decompressed data
// out:
//   0 - decompression ok
//   1 - decompression failed
int UnULZ(unsigned char *source, unsigned char *target) {
  unsigned char  *ulz_f, *ulz_d, *ulz_c, *tgt, *end;
  unsigned int    len, pos, nbits, msk;
  unsigned char   flags, mask;

  if (source[0x06] != 0x02) {
    if (_DEBUG_) printf("Mode %02X not supported\n", source[0x06]);
    return (1);
  }

  //the top comment block says "for LZ algorithm"
  //this int is used later for shift, how is calculates in the first place?
  nbits = source[0x07];
  //why divide 0xffff by 16-nbits? is nbits supposed to be at least 16?
  msk = 0xFFFF >> (16 - nbits);

  //read the first flag byte
  ulz_f = source + 0x14;

  //points to the first decompressed data
  ulz_d = source + *(unsigned int *)(source + 0x08);
  //where compressed data starts
  ulz_c = source + *(unsigned int *)(source + 0x0C);

  //target buffer
  tgt   = target;
  //the format tells us the final size
  end   = target + *(unsigned int *)(source + 0x10);

  for (mask = 0; tgt < end; mask <<= 1) {

    /*
    Now we check for every flag, if the flag is 1
    we have uncompressed data to copy into the target buffer
    */

    if (!mask) { 
        mask = 0x01; 
        /*
        load the next flag data if we finish to scan 
        the first (or previous flag byte)
        */
        flags = *ulz_f++; 
    };

    if (flags & mask) {
        //we have a bit that tell us to copy uncompressed 
        //data into the target buffer
        *tgt++ = *ulz_d++;
    } else {

        /*
        Here starts the part that I don't understand, 
        I know we have to find out which uncompressed 
        data to copy and how long is the "run" for it.

        So we start to copy 2 bytes, one for the run
        (here calls len) an one for the offset to copy.
        */


        pos = *ulz_c++;
        pos |= *ulz_c++ << 8;

        //why we have to shift by nbits and add 2?
        //how is suppose to help the compression?
        len = (pos >> nbits) + 2;

        //why this bitwise + 1?
        pos = (pos & msk) + 1;

        //copy the data to the target buffer using our
        //decompressed data
        while (len--){
            *tgt++ = *(tgt - pos);
        }
    }
  }

  return (0);
}

To recap, why the original author uses the nbits value in this way?

msk = 0xFFFF >> (16 - nbits);

Why uses those constants and shifts by nbits?

pos = *ulz_c++;
pos |= *ulz_c++ << 8;

len = (pos >> nbits) + 2;
pos = (pos & msk) + 1;

Thanks

0

There are 0 answers