Extract image position from .docx file using python-docx

17.9k views Asked by At

I'm trying to get the image index from the .docx file using python-docx library. I'm able to extract the name of the image, image height and width. But not the index where it is in the word file

import docx
doc = docx.Document(filename)
for s in doc.inline_shapes:
    print (s.height.cm,s.width.cm,s._inline.graphic.graphicData.pic.nvPicPr.cNvPr.name)

output

21.228  15.920 IMG_20160910_220903848.jpg

In fact I would like to know if there is any simpler way to get the image name , like s.height.cm fetched me the height in cm. My primary requirement is to get to know where the image is in the document, because I need to extract the image and do some work on it and then again put the image back to the same location

4

There are 4 answers

0
scanny On BEST ANSWER

This operation is not directly supported by the API.

However, if you're willing to dig into the internals a bit and use the underlying lxml API it's possible.

The general approach would be to access the ImagePart instance corresponding to the picture you want to inspect and modify, then read and write the ._blob attribute (which holds the image file as bytes).

This specimen XML might be helpful: http://python-docx.readthedocs.io/en/latest/dev/analysis/features/shapes/picture.html#specimen-xml

From the inline shape containing the picture, you get the <a:blip> element with this:

blip = inline_shape._inline.graphic.graphicData.pic.blipFill.blip

The relationship id (r:id generally, but r:embed in this case) is available at:

rId = blip.embed

Then you can get the image part from the document part

document_part = document.part
image_part = document_part.related_parts[rId]

And then the binary image is available for read and write on ._blob.

If you write a new blob, it will replace the prior image when saved.

You probably want to get it working with a single image and get a feel for it before scaling up to multiple images in a single document.

There might be one or two image characteristics that are cached, so you might not get all the finer points working until you save and reload the file, so just be alert for that.

Not for the faint of heart as you can see, but should work if you want it bad enough and can trace through the code a bit :)

2
Kf H. On

You can also inspect paragraphs with a simple loop, and check which xml contains an image (for example if an xml contains "graphicData"), that is which is an image container (you can do the same with runs):

from docx import Document

image_paragraphs = []
doc = Document(path_to_docx)
for par in doc.paragraphs:
    if 'graphicData' in par._p.xml:
        image_paragraphs.append(par)

Than you unzip docx file, images are in the "images" folder, and they are in the same order as they will be in the image_paragraphs list. On every paragraph element you have many options how to change it. If you want to extract img process it and than insert it in the same place, than

paragraph.clear()
paragraph.add_run('your description, if needed')
run = paragraph.runs[0]
run.add_picture(path_to_pic, width, height)
0
Itamar Rocha Filho On

So, I've never really written any answers here, but i think this might be the solution to your problem. With this little code you can see the position of your images given all the paragraphs. Hope it helps.

import docx

doc = docx.Document(filename)

paraGr = []             
index = []

par = doc.paragraphs
for i in range(len(par)):
     paraGr.append(par[i].text)
     if 'graphicData' in par[i]._p.xml:
         index.append(i)
0
dataninsight On

If you are using Python 3

pip install python-docx

import docx
doc = docx.Document(document_path)
P = []
I = []
par = doc.paragraphs
for i in range(len(par)):
     P.append(par[i].text)
     if 'graphicData' in par[i]._p.xml:
         I.append(i)
print(I)

#returns list of index(Image_Reference)