I have a docx-file with multiple tables and I want to get all of the information from the tables in a list (the list is called 'alletabellen'). With the script below I receive almost all of the information in the tables, except the values of some variables which are in a dropdown list (in some of the table cells). The values of these cells remain empty in my list (for example the value '1.2' from the variable 'Number:', see: https://s30.postimg.org/477j8z6ch/table.png I do not get that value in my list).
Is it possible to get the information from these variables as well?
import docx
bestand = docx.Document('somefile.docx')
tabellen = bestand.tables
alletabellen = []
for i, tabel in enumerate(tabellen):
for row in tabellen[i].rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
alletabellen.append(paragraph.text)
Update
I found a solution (thanks to scanny who pointed me into the right direction). I didn't realize a docx-file is actually a zipped file with a xml-file with all the text among other things. I used the module zipfile to extract the docx and the module bs4 to find all dropdown list tags ('ddList') and put the data in a list. In my document there are 12 dropdownlists and I only needed 3 of them (one of them being 'Number:' from the screenshot, which is the first dropdown list in the document).
import docx
import zipfile
from bs4 import BeautifulSoup
doc = 'somefile.docx'
bestand = docx.Document(doc)
tabellen = bestand.tables
#get data from all the "normal" fields
alletabellen = []
for i, tabel in enumerate(tabellen):
for row in tabellen[i].rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
alletabellen.append(paragraph.text)
#get data from all the dropdown lists
document = zipfile.ZipFile(doc)
xml_data = document.read('word/document.xml')
document.close()
soup = BeautifulSoup(xml_data, 'xml')
gegevens = soup.findAll('ddList') #search dropdownlists (n = 12)
dropdownlist = []
dropdownlistdata = []
for i in gegevens:
dropdownlist.append(i.find('result'))
#convert to string for if statements
number = str(dropdownlist[0])
job = str(dropdownlist[1])
vehicle = str(dropdownlist[7])
if number == '<w:result w:val="1"/>' :
dropdownlistdata.append('0,3')
elif number == '<w:result w:val="2"/>' :
dropdownlistdata.append('1,2')
elif number == '<w:result w:val="3"/>' :
dropdownlistdata.append('onbekend')
else:
dropdownlistdata.append('geen')
if job == '<w:result w:val="1"/>' :
dropdownlistdata.append('nee')
else:
dropdownlistdata.append('ja')
if vehicle == '<w:result w:val="1"/>' :
dropdownlistdata.append('nee')
else:
dropdownlistdata.append('ja')
#show data
print alletabellen
print dropdownlistdata
The reason the '1.2' isn't coming back from the
.text
call is most likely that it's wrapped in some sort of "container" XML to make it behave like a form field.The first step would be to inspect the XML so you can see what you're up against. Then you would write some code to find the buried content.
opc-diag
can help you inspect your XML: http://opc-diag.readthedocs.io/en/latest/index.htmlYou'll want to be looking in the
document.xml
part.If you trim down your document to just the minimum that exhibits this behavior, that makes it easier to locate the portion you need to work on.
If you can post the XML of that part of the table I can direct you further.