python-docx get info from dropdownlist (in table)

2.5k views Asked by At

I have a docx-file with multiple tables and I want to get all of the information from the tables in a list (the list is called 'alletabellen'). With the script below I receive almost all of the information in the tables, except the values of some variables which are in a dropdown list (in some of the table cells). The values of these cells remain empty in my list (for example the value '1.2' from the variable 'Number:', see: https://s30.postimg.org/477j8z6ch/table.png I do not get that value in my list).

Is it possible to get the information from these variables as well?

import docx

bestand = docx.Document('somefile.docx')
tabellen = bestand.tables

alletabellen = []     
for i, tabel in enumerate(tabellen):
    for row in tabellen[i].rows:
        for cell in row.cells:
            for paragraph in cell.paragraphs:
                alletabellen.append(paragraph.text)

Update

I found a solution (thanks to scanny who pointed me into the right direction). I didn't realize a docx-file is actually a zipped file with a xml-file with all the text among other things. I used the module zipfile to extract the docx and the module bs4 to find all dropdown list tags ('ddList') and put the data in a list. In my document there are 12 dropdownlists and I only needed 3 of them (one of them being 'Number:' from the screenshot, which is the first dropdown list in the document).

import docx
import zipfile
from bs4 import BeautifulSoup

doc = 'somefile.docx'

bestand = docx.Document(doc)
tabellen = bestand.tables

#get data from all the "normal" fields

alletabellen = []     
for i, tabel in enumerate(tabellen):
    for row in tabellen[i].rows:
        for cell in row.cells:
            for paragraph in cell.paragraphs:
                alletabellen.append(paragraph.text)

#get data from all the dropdown lists

document = zipfile.ZipFile(doc)
xml_data = document.read('word/document.xml')
document.close()

soup = BeautifulSoup(xml_data, 'xml')
gegevens = soup.findAll('ddList')     #search dropdownlists (n = 12)

dropdownlist = []
dropdownlistdata = []

for i in gegevens:
    dropdownlist.append(i.find('result'))

#convert to string for if statements
number = str(dropdownlist[0])
job = str(dropdownlist[1])
vehicle = str(dropdownlist[7])

if number == '<w:result w:val="1"/>' :
    dropdownlistdata.append('0,3')
elif number == '<w:result w:val="2"/>' :
    dropdownlistdata.append('1,2')
elif number == '<w:result w:val="3"/>' :
    dropdownlistdata.append('onbekend')
else:
    dropdownlistdata.append('geen')

if job  == '<w:result w:val="1"/>' :
    dropdownlistdata.append('nee')
else:
    dropdownlistdata.append('ja')

if vehicle == '<w:result w:val="1"/>' :
    dropdownlistdata.append('nee')
else:
    dropdownlistdata.append('ja')

#show data
print alletabellen
print dropdownlistdata
1

There are 1 answers

1
scanny On BEST ANSWER

The reason the '1.2' isn't coming back from the .text call is most likely that it's wrapped in some sort of "container" XML to make it behave like a form field.

The first step would be to inspect the XML so you can see what you're up against. Then you would write some code to find the buried content.

opc-diag can help you inspect your XML: http://opc-diag.readthedocs.io/en/latest/index.html

You'll want to be looking in the document.xml part.

If you trim down your document to just the minimum that exhibits this behavior, that makes it easier to locate the portion you need to work on.

If you can post the XML of that part of the table I can direct you further.