I had an issue with opening a .docx file and i wondered "what about getting the data straight from the bin?".
My first attempt was to just check the bin data in a .docx, the doc only contains a "Hello World" in Calibri 11.
The next are parts of the content in binary.
From what i could tell, it was compressed or encrypted. And the fourth part of the bin seemed to be the largest.
After researching for a while, I found someone saying ".docx are basically .zip with .xml files inside" So i tried turning the .docx into a .zip and extracting files and worked just fine. But i was like "ok, cool, how do i get that compressed data straight from the .docx file compressed?". So i've been researching on how to manually "decompress" .zip files. Found out the method of compression, zip seems to be able to handle multiple methods, the most common is deflate which uses LZ77 compressor so i gotta get to understand this algorithm in order to decompress the .docx file data but not its content. Not a big progress yet on the matter, but i'm still making my way through.
My reasoning to solve this problem is:
- Find file section of interest
- Decompress the section of interest
- Get text
From what i get on how the zip compression works, maybe the decompression should be made in the whole file.
TLDR: I want to manually decompress the data in a docx, my objective is not to get the data but to understand the compression process and the structure of a docx, so i can get it from this: (Compressed part of what i think is the Text in the .docx)
To this: (Uncompressed part of the Text from the extracted document.xml)
Is there an already existing method for doing what i'm willing to do? Is it too crazy based on how compression works?
Correct, .docx files can be extracted just like compressed .zip files can. You'd just need to change the file's extension to become .zip instead of .docx.
...However, you may need to enable the ability to see file name extensions first, otherwise, just adding '.zip' to the end of the file name would just be changing the name of the file and not it's extension.
Once the file name / extension is changed from something like "MyFile.docx" to "MyFile.zip", the file's icon should also change to indicate it as a compressed file giving you the options to decompress / extract like any other compressed .zip file.
After the file is extracted, Word document text should be stored in a subfolder called 'word' in a file called 'document.xml'. The 'document.xml' file can be opened using Notepad. Here you'll find the document's text along with information used by MS Word to format it within MS Word.
...Of course, if the file is password protected, this information will be encrypted and not very useful.