I need to decompress a data model file embedded in xlsx file. The file is supposed to use the MS-XLDM file format and should consist of 3 sections (Spreadsheet Data Model Header, Files and Virtual Directory) and only the middle one is compressed. The first and last section are xml with unicode/utf-16 encoding presumably (every other byte is 0x00 and the content is preceded by 0xFF and 0xFE ). The middle file is preceded by a small chunk of xml. More detail about the file structure.
Now according to the documentation the file should be compressed using Xpress compression specified here which uses LZ77 compression and DIRECT2 encoding.
Now to get to the point. From my understanding, there should always be a 4 byte bitmask which indicates if byte in corresponding position should be a 1:1 data or metadata.
For example, given a hypothetical 8-bit bitmask, the string "ABCABCDEF" is compressed as (0,0)A(0,0)B(0,0)C(3,3)D(0,0)E(0,0)F. Its bitmask would be b'00010001' (0x11).
If given position is supposed to be metadata, at least 2 bytes should be read. Out of the 16 bits the first 13 is offset and the last 3 are the length (unless the last bit is 1, then another byte must be read).
So now onto the concrete example that I struggle with. The first 2 chunks are easy.
First one is:
....<Load xmlns="http://schemas.micr
The first 4 bytes (the dots) are 0x00 thus the 32 bytes that follow are uncompressed. Next chunk is similar:
....osoft.com/analysisservices/2003/
Now the 3rd chunk is where I get lost
w±engine":ddl27/"/2G_W?%g100gO8eðg_‡_)§è.Õ®]›‡o
I'm not sure where does the chunk exactly end because when I started counting every 36 bytes after those first ones after a while I would reach a portion of the byte stream which should be uncompressed and it didn't line up.
So back to the 3rd chunk. The bitmask for this one is 0x77 0xB1 0x04 0x01.
Or in binary 01110111 10110001 00000100 00000001. I tried to line it up with the bytes and it didn't make any sense. Clearly the word engine" is uncompressed and it fits to the previous chunks because a quick google search revealed to me a result with namespace "http://schemas.microsoft.com/analysisservices/2003/engine".
01110111 10110001 00000100 00000001
engine" :ddl27 /"/2G_W ?%g100gO8eðg_‡_)
This made me think that maybe the bytes if the bitmask are in reverse order. This made more sense to me.
00000001
engine"
If this was true, then the metadata should be 0x0B 0x02.
Or in binary 00001011 00000010. So if I split it up, the first 13 bits make up the offset of the metadata. And the length is 010 + constant offset 3 = 2+3=5.
Before 0000101100000
Invert 1111010011111
Decimal -353
But looking 353 bytes back it lands in the uncompressed partition xml section and should return the characters in parentheses (a.m.e). This doesn't make sense to me and is probably wrong.
Here is the file I tried to decompress.