Issue while concatenating 2 docx files

55 views Asked by At

I am trying to create a combination docx file that will be the concatenation of 2 docx files. I have the following python code:

from docx import Document

files = ['Doc2.docx', 'Doc3.docx']

def combine_word_documents(files):
    combined_document = Document('empty.docx')
    count, number_of_files = 0, len(files)
    for file in files:
        sub_doc = Document(file)

        # Don't add a page break if you've
        # reached the last file.
        if count < number_of_files - 1:
            sub_doc.add_page_break()

        for element in sub_doc._document_part.body._element:
            combined_document._document_part.body._element.append(element)
        count += 1

    combined_document.save('both_docx_files.docx')

combine_word_documents(files)

The issues are:

  1. in the resulted both_docx_files.docx file, the docx files are overlapping one on the other instead of being in separate pages.
  2. images are lost Any help or advice is appreciated.

I tried the given python code. docx files should be concatenated one after the other in the new docx file.

2

There are 2 answers

0
OldBoy On

You are adding the page break at the beginning of the first file rather than at the end:

if count < number_of_files - 1:
    sub_doc.add_page_break()

for element in sub_doc._document_part.body._element:
    combined_document._document_part.body._element.append(element)
count += 1

Move the test and page break after the for loop as follows:

for element in sub_doc._document_part.body._element:
    combined_document._document_part.body._element.append(element)
if count < number_of_files - 1:
    sub_doc.add_page_break()

count += 1

You need to provide more information in relation to the second question, as it is not clear what you mean.

[edit]
I have installed the python-docx module and tried to reproduce your problem. It seems that the package cannot copy pictures from one to the other, and indeed there is no obvious way of identifying an element as a picture as far as I can tell from the documentation.

Also I get an exception when using the code as supplied on the line(s) that include a reference to _document_part.body. I was able to correct it by replacing it with the simple _body. So maybe the version I downloaded (v1.1.0) is different from the one you are using.

[/edit]

1
Towsif Ahamed Labib On

The first problem is related to page breaks; the issue is already in the previous response, which will not be so crucial when you consider the second problem.

Images are not directly stored in the document's XML body; they are stored separately in the document's media folder and referenced within the XML. Simply appending XML elements won't copy the images or their references correctly, leading to their absence in the combined document.

For images, the situation is complex because you need to extract images from the source documents and insert them into the combined document at the correct positions. Handling images properly requires manually copying the image files from the source documents' media directories into the combined document's media directory and fixing up the references in the document's XML. python-docx does not provide a built-in way to do this. This process can involve directly manipulating the underlying opc package parts, which can get quite technical.

check this for better understanding: Lost formatting and image after search and replace using python-docx

Possible Workaround for Images: https://github.com/etfre/oodocx, check this repo, it could work in your case.

from oodocx import Docx
doc1 = Docx('path_to_first_document.docx')
doc2 = Docx('path_to_second_document.docx')
doc1.merge(doc2, page_break=True)
doc1.save('path_to_concatenated_document.docx')

A quick solution might be to convert the documents to PDFs, merge the PDFs, and then, if absolutely necessary, convert back to DOCX. There are libraries and tools available for PDF manipulation that might handle such tasks more gracefully.