As of now, I am using Python-docx to convert the text in a .docx file into a single string.
f = open(os.path.expanduser("~/documents/myFile.docx"))
document = opendocx(f)
docString = ''.join(getdocumenttext(document))
I am then parsing the string using simple built-in Python split methods. Once the string is parsed into a list, I am loading that list into a MySQL database. This works great, but my only problem is I want to preserve the special characters.
The database supports these special character (utf-8) but a lot of characters and formatting (italics, bold, etc.) are lost when I convert the .docx into a string.
I want to be able to parse and load text with the formatting intact from the .docx file.