My first time posting - please go easy on me. I could not come up with a succinct title that summarizes this issue. I seem to have a codec problem.
My django-based website calls a subprocess (soffice) to convert uploaded documents to basic text files, to then go on to do some processing of the text from the doc. This was working beautifully for a time. On my local dev machine, the unit tests for file conversion still work perfect as does the complete django app, end-to-end. On the production server, where it all used to work, the file conversion call no longer works the same from within the django app, while it does work properly when run from the test code. This change in behavior appears to be the result of running general server updates.
args = ['soffice',
'--headless',
'--convert-to',
'txt:Text',
'--outdir',
outDir,
filePath]
subprocess.call(args)
fo = open(textFilePath, "r")
try:
docText = fo.read()
except:
print("Failed to read", textFilePath)
docText = None
I removed some of the error checking to simplify a bit.
When I run the file conversion code as part of the complete django application on the production server, I can see that certain special characters such as symbol ยง are turned into garbage. But if I run the same file conversion code on its own, outside of django, on the same machine, those symbols are not corrupted. As mentioned, on my dev machine, it works both standalone and within django. The one difference between the two machines is how I run django. Locally, it's run using django's runserver command. On the production machine, it's using mod_wsgi with apache. I don't see how it's possible for django or mod_wsgi to interfere with what soffice is doing in the subprocess, but it does appear that way. I have opened a python shell on the problem server and run essentially the same code as above, getting clean text back, plus running the unit tests against it works too.
Any help is sincerely appreciated!
The solution was to upgrade mod_wsgi using: