An efficient way to convert document to pdf format

19.4k views Asked by At

I have been trying to find the efficient way to convert document e.g. doc, docx, ppt, pptx to pdf. So far i have tried docsplit and oowriter, but both took > 10 seconds to complete the job on pptx file having size 1.7MB. Can any one suggest me a better way or suggestions to improve my approach?

What i have tried:

from subprocess import Popen, PIPE
import time

def convert(src, dst):
    d = {'src': src, 'dst': dst}
    commands = [
        '/usr/bin/docsplit pdf --output %(dst)s %(src)s' % d,
        'oowriter --headless -convert-to pdf:writer_pdf_Export %(dst)s %(src)s' % d,
    ]

    for i in range(len(commands)):
        command = commands[i]
        st = time.time()
        process = Popen(command, stdout=PIPE, stderr=PIPE, shell=True) # I am aware of consequences of using `shell=True` 
        out, err = process.communicate()
        errcode = process.returncode
        if errcode != 0:
            raise Exception(err)
        en = time.time() - st
        print 'Command %s: Completed in %s seconds' % (str(i+1), str(round(en, 2)))

if __name__ == '__main__':
    src = '/path/to/source/file/'
    dst = '/path/to/destination/folder/'
    convert(src, dst)

Output:

Command 1: Completed in 11.91 seconds
Command 2: Completed in 11.55 seconds

Environment:

  • Linux - Ubuntu 12.04
  • Python 2.7.3

More tools result:

4

There are 4 answers

3
avenet On BEST ANSWER

Try calling unoconv from your Python code, it took 8 seconds on my local machine, I don't know if it's fast enough for you:

time unoconv 15.\ Text-Files.pptx
real    0m8.604s
5
Vasudev Ram On

Unfortunately I don't have the time to do a full benchmark, but you may want to check out xtopdf, my Python toolkit for PDF creation. It doesn't do the full range of conversions you want, and some of the conversions have limitations, but it may be of use. xtopdf links:

Online presentation about xtopdf - a good summary of what it is, what it does, platforms, features, users, uses etc.: http://slid.es/vasudevram/xtopdf

xtopdf on Bitbucket: https://bitbucket.org/vasudevram/xtopdf

Many blog posts showing how to use xtopdf for various purpose, including many that show how to use it to convert different input formats to PDF: http://jugad2.blogspot.com/search/label/xtopdf

HTH, Vasudev Ram

2
jeffknupp On

Pandoc is a wonderful tool capable of doing what you'd like quickly. Since you're using Popen to effectively shell out the command for the tool, it doesn't matter what language the tool is written in (Pandoc is written in Haskell).

0
JasonPlutext On

For doc and docx (but not ppt/pptx), you could try our independent (but commercial) high fidelity rendering engine online at OnlineDemo/docx_to_pdf

By "high fidelity", I mean it is designed from the ground up to have the same line and paragraph breaks, tab stops etc etc as Microsoft Word.