Document Conversion Realtime - Implementation Questions

128 views Asked by At

We have a need to convert MS Office documents to PDF real time when someone provides a link to a document after checking whether user is authorized to view the document or not for an intranet portal. We also need to cache the documents based on the last modified date of the document, we should not convert the document again if another user requests the same document and the document content is not modified since it was last converted.

I have some basic questions on how we can implement this - and would like to check if anyone has previous experience or thoughts how they see this implemented?

For example, if we choose J2EE as the technology, and choose one of the open source Java libraries for PDF conversion; I have following questions.

  1. If there is a 100 MB document - we would need to download entire document from the system where the document is hosted before we start converting the document. This approach may have major concerns on the response time given that this needs to be real time viewing. Is there an option to read first page of a document without downloading entire document so that we can convert document page by page?
  2. How can we cache a document? I do not think we can either store the document in server or database. The reason is this could lead to anyone who is having access to either database or server - can access document content. Any thoughts?
  3. Or do you suggest any out of the box product to do this instead of custom development?

Thanks

1

There are 1 answers

0
Jeroen Ritmeijer On BEST ANSWER

I work for a company that creates a product that does exactly what you are trying to do using Java / .NET Web service calls, so let me see if I can answer your questions without bias.

  1. The whole document will need to be downloaded as it will need to be interpreted before PDF Conversion (e.g. for page numbering purposes) can take place. I am sure you are just giving an example, but 100MB is very large for an MS-Office document, although we do see it from time to time.

  2. You can implement caching based on your exact security requirements. If you don't want to store the converted files in a (secured) DB or file system then perhaps you want to store them on a different server behind a firewall. Depending on the number of documents and size you anticipate you may want to cache them in memory. I am sure there are many J2EE caching libraries available, I know there are plenty in .NET. Just keep the most frequently requested documents in your cache.

  3. Depending on your budget you may go for an out of the box product (hint hint :-). I know there are free libraries available for Java that leverage Open Office, but you get the same formatting limitations when opening MS-Office Files in OO. Be careful when trying to do your own MS-Office integration / automation. It is possible to make it reliable and scalable (we did), but it takes a long time and a lot of work.

I hope this helps.