How to get content from .docx in a binary level?

1.5k views Asked by At

I had an issue with opening a .docx file and i wondered "what about getting the data straight from the bin?".

My first attempt was to just check the bin data in a .docx, the doc only contains a "Hello World" in Calibri 11.

The next are parts of the content in binary.

First part of the bin

Second part of the bin

From what i could tell, it was compressed or encrypted. And the fourth part of the bin seemed to be the largest.

After researching for a while, I found someone saying ".docx are basically .zip with .xml files inside" So i tried turning the .docx into a .zip and extracting files and worked just fine. But i was like "ok, cool, how do i get that compressed data straight from the .docx file compressed?". So i've been researching on how to manually "decompress" .zip files. Found out the method of compression, zip seems to be able to handle multiple methods, the most common is deflate which uses LZ77 compressor so i gotta get to understand this algorithm in order to decompress the .docx file data but not its content. Not a big progress yet on the matter, but i'm still making my way through.

My reasoning to solve this problem is:

  1. Find file section of interest
  2. Decompress the section of interest
  3. Get text

From what i get on how the zip compression works, maybe the decompression should be made in the whole file.

TLDR: I want to manually decompress the data in a docx, my objective is not to get the data but to understand the compression process and the structure of a docx, so i can get it from this: Compressed part of what i think is the Text (Compressed part of what i think is the Text in the .docx)

To this: Uncompressed part of the Text (Uncompressed part of the Text from the extracted document.xml)

Is there an already existing method for doing what i'm willing to do? Is it too crazy based on how compression works?

2

There are 2 answers

2
KimblerS On

Correct, .docx files can be extracted just like compressed .zip files can. You'd just need to change the file's extension to become .zip instead of .docx.
...However, you may need to enable the ability to see file name extensions first, otherwise, just adding '.zip' to the end of the file name would just be changing the name of the file and not it's extension.

Once the file name / extension is changed from something like "MyFile.docx" to "MyFile.zip", the file's icon should also change to indicate it as a compressed file giving you the options to decompress / extract like any other compressed .zip file.

After the file is extracted, Word document text should be stored in a subfolder called 'word' in a file called 'document.xml'. The 'document.xml' file can be opened using Notepad. Here you'll find the document's text along with information used by MS Word to format it within MS Word.

...Of course, if the file is password protected, this information will be encrypted and not very useful.

0
kjhughes On

DOCX represents word processing data not as binary but as XML documents (WordprocessingML, part of Office Open XML, OOXML) stored in a compressed archive format (Open Packaging Conventions, based on the ZIP file format).

From what i get on how the zip compression works, maybe the decompression should be made in the whole file.

Right, rather than starting with finding the section of interest, you'd start by decompressing the DOCX file and then finding the section and text of interest, likely within the main document.xml file.

While maximum control can be obtained by operating at the XML level, WordprocessingML is a complex format that's more easily approached with the help of a library such as the Open XML SDK.

See also: