python-docx get info from dropdownlist (in table)

Question

python-docx get info from dropdownlist (in table)

2.5k views Asked by Joost At 13 October 2024 at 19:40

I have a docx-file with multiple tables and I want to get all of the information from the tables in a list (the list is called 'alletabellen'). With the script below I receive almost all of the information in the tables, except the values of some variables which are in a dropdown list (in some of the table cells). The values of these cells remain empty in my list (for example the value '1.2' from the variable 'Number:', see: https://s30.postimg.org/477j8z6ch/table.png I do not get that value in my list).

Is it possible to get the information from these variables as well?

import docx

bestand = docx.Document('somefile.docx')
tabellen = bestand.tables

alletabellen = []     
for i, tabel in enumerate(tabellen):
    for row in tabellen[i].rows:
        for cell in row.cells:
            for paragraph in cell.paragraphs:
                alletabellen.append(paragraph.text)

Update

I found a solution (thanks to scanny who pointed me into the right direction). I didn't realize a docx-file is actually a zipped file with a xml-file with all the text among other things. I used the module zipfile to extract the docx and the module bs4 to find all dropdown list tags ('ddList') and put the data in a list. In my document there are 12 dropdownlists and I only needed 3 of them (one of them being 'Number:' from the screenshot, which is the first dropdown list in the document).

import docx
import zipfile
from bs4 import BeautifulSoup

doc = 'somefile.docx'

bestand = docx.Document(doc)
tabellen = bestand.tables

#get data from all the "normal" fields

alletabellen = []     
for i, tabel in enumerate(tabellen):
    for row in tabellen[i].rows:
        for cell in row.cells:
            for paragraph in cell.paragraphs:
                alletabellen.append(paragraph.text)

#get data from all the dropdown lists

document = zipfile.ZipFile(doc)
xml_data = document.read('word/document.xml')
document.close()

soup = BeautifulSoup(xml_data, 'xml')
gegevens = soup.findAll('ddList')     #search dropdownlists (n = 12)

dropdownlist = []
dropdownlistdata = []

for i in gegevens:
    dropdownlist.append(i.find('result'))

#convert to string for if statements
number = str(dropdownlist[0])
job = str(dropdownlist[1])
vehicle = str(dropdownlist[7])

if number == '<w:result w:val="1"/>' :
    dropdownlistdata.append('0,3')
elif number == '<w:result w:val="2"/>' :
    dropdownlistdata.append('1,2')
elif number == '<w:result w:val="3"/>' :
    dropdownlistdata.append('onbekend')
else:
    dropdownlistdata.append('geen')

if job  == '<w:result w:val="1"/>' :
    dropdownlistdata.append('nee')
else:
    dropdownlistdata.append('ja')

if vehicle == '<w:result w:val="1"/>' :
    dropdownlistdata.append('nee')
else:
    dropdownlistdata.append('ja')

#show data
print alletabellen
print dropdownlistdata

Original Q&A

There are 1 answers

**scanny** · Accepted Answer · 2017-01-09 20:22:43

The reason the '1.2' isn't coming back from the .text call is most likely that it's wrapped in some sort of "container" XML to make it behave like a form field.

The first step would be to inspect the XML so you can see what you're up against. Then you would write some code to find the buried content.

opc-diag can help you inspect your XML: http://opc-diag.readthedocs.io/en/latest/index.html

You'll want to be looking in the document.xml part.

If you trim down your document to just the minimum that exhibits this behavior, that makes it easier to locate the portion you need to work on.

If you can post the XML of that part of the table I can direct you further.

TechQA.

python-docx get info from dropdownlist (in table)

There are 1 answers

Related Questions in PYTHON

Related Questions in PYTHON-DOCX

Popular Questions

Popular Tags

Trending Questions