Thoroughly cofused about using .doc APIs

82 views Asked by At

Let me start off by saying my python knowledge is beginner-to-intermediate level, and I recently started using the language again after a long time.

The Goal:

This morning I came across a bunch of word documents I wanted to convert and concatenate to PDF files, with 2 .doc files creating one PDF. seemed like a fairly trivial task, so I figured I'd try to learn how to do it in python. concatenating PDFs wasn't too bad, I found PyPDF2 and managed to write a script that did just that.

But 7 hours later, after countless scripts with broken dependencies- I still can't find a way to automate the doc-pdf conversion.

The Problem(s):

every script I found either:

  1. uses python-docx (my documents are word 2003 .docs)
  2. uses unoconv bridge (which I installed along with OpenOffice, then searched around for documentation but found none- thus I have no idea how to call from a python script or the shell. I saw one example for this but it keeps throwing errors)
  3. uses win32com or win32com.client or pywin32 or somesuch. I ran into numerous issues with these- installed one but couldn't import it from code (as happened to the guy here), now I can't even find them with pip. searched for documentation for them (are they modules or classes? I have no idea) and found practically nothing that I could understand, beyond that they're connected to ActivePython. (which is apparantly a superset of Python with more capabilities?).
  4. Uses comtypes, which I installed but was unable to use/import either for some reason (maybe I'm using pip wrong somehow?)

I know my question is hardly focused but honestly by now my brain is fried from information overload. any simplifications for a noob would be more than welcome.

TL;DR:

assuming no knowledge of COM stuff and little experience with any external frameworks:

  1. what would I have to do to convert Word 2003 .doc files to .pdf files? I'm running python3.5.1 32-bit on a Windows 10 64-bit machine.
  2. where can I learn more about accessing other software APIs from python? are there big prerequisites for this stuff like knowing how the OS works on a lower level?

Thanks!

1

There are 1 answers

0
Gribouillis On

From my experience, converting between the various office formats is best done outside of python. With the subprocess module, you can call the external command

soffice --convert-to pdf file.doc  --headless

where soffice is the command that comes with LibreOffice.