Reading docx files, recognizing and storing italicized text

2.7k views Asked by At

How should I go about reading a .docx file with Python and being able to recognize the italicized text and storing it as a string?

I looked at the docx python package but all I see is features for writing to a .docx file.

I appreciate the help in advance

2

There are 2 answers

1
ChrisGuest On BEST ANSWER

Here's what my example document, TestDocument.docx, looks like.

enter image description here

Note: The word "Italic" is in Italics, but "Emphasis" uses the style, Emphasis.

If you install the python-docx module. This is a fairly simple exercise.

>>> from docx import Document
>>> document = Document('TestDocument.docx')
>>> for p in document.paragraphs:
...     for run in p.runs:
...             print run.text, run.italic, run.bold
... 
Test Document None None
Italics True None
Emp None None
hasis None None
>>> [[run.text for run in p.runs if run.italic] for p in document.paragraphs]
[[], ['Italics'], []]

The Run.italic attribute captures whether the text is formatted as Italic, but it doesn't know if a text block has a Style that is rendered in Italic, but it can be detected by checking Run.style.name (if you know what styles in your document are rendered in Italics.

>>> [[run.text for run in p.runs if run.style.name=='Emphasis'] for p in document.paragraphs]
[[], [], ['Emp', 'hasis']]
0
Miles Shipman On

Your best bet is going about unzipping the docx which will create a directory called word. Within that directory is document.xml, from there you would need to learn the xml structure and key words to be able to read just an italicized text. once you complete that all you have to do is pull the text string from xml file.