I've been reading the description of xz file format ( http://tukaani.org/xz/xz-file-format.txt ). But when I try to look into an xz file with binary editor, it doesn't seem to follow the structure defined in the description. What am I missing?
I compressed the description file (xz-file-format.txt) with xz cli utility in linux (xz version 4.999.9beta) and these are the first 32 bytes I get:
FD 37 7A 58 5A 00 00 04 E6 D6 B4 46 02 00 21 01 16 00 00 00 74 2F E5 A3 E0 A9 28 2A 99 5D 00 05
Overall structure of the file should be: stream - stream padding - stream - and so on. And in this case I think there should be only one stream since there is only one file compressed in the file. Structure of the stream is: stream header - block - block - ... - block - index - stream footer. And structure of the stream header is: header magic bytes - stream flags - crc code.
I can find the stream header from my file, but after the first sixteen bytes it doesn't seem to follow the description anymore.
First six bytes above are clearly the magic bytes. Next two bytes are the stream flags. Stream flags indicate that CRC64 is being used, so the CRC code takes next eight bytes. Seventeenth byte (I count from one) should then be the first byte of the first block.
Structure of a block is: block header - compressed data - block padding - check. Structure of block header should be: block header size - block flags - compressed size - uncompressed size - list of filter flags - header padding - CRC. So the seventeenth byte should then be block header size (0x16 in my file). That's possible, but the eighteenth byte seems a bit weird. It should be the block flags bit field. In my file it's null - so no flags set. Not even the number of filters, which according to description should be 1-4.
Since bits 6 and 7 of the block flags are also zeros, compressed and uncompressed sizes should not be present in the file and the next bytes should be the list of filter flags. Structure of the list is: filter ID - size of properties - filter properties. Nineteenth byte should then be filter ID. This is null in my file which is not any of officially defined filter IDs. If it would be a custom ID it would take nine bytes, but as I understand the encoding of sizes described in section 1.2 of the description it can't be, since according to the description: "All but the last byte of the multibyte representation have the highest (eighth) bit set.", but in my file the twentieth byte is also null.
So is there something I don't understand or is the file not following the description?
I asked the question a bit hastily and came up with a solution myself. Just in case someone would be interested, I answer my own question.
I had misunderstood the meaning of the stream flags in stream header. They don't affect the CRC code in the header (which is always CRC32), just CRCs in the stream itself (as the name stream flags implies). This means that the CRC in the header is only four bytes long and thus bytes 13-24 form a valid block header.
In the block header, the block flags field is again a null byte, which I saw as a problem before. According to the description, number of filters should be between 1 and 4. So I expected a decimal value of at least one. Since number of filters is expressed with two bits the maximum decimal value is 3, but number of possible values (zero included) is of course four and thus zero means one filter.
Since also the last two bits of the block flags are zeros, no compressed size or uncompressed size fields are present in the block header. This means that bytes 15-17 are the filter flags for the first (and only) filter. Filter id 0x21 is the id of LZMA2 filter. Size of properties 0x01 means size of one byte. And dictionary size 0x16 means size of 4096 KiB.