How to use borb and a Translate API to translate a PDF while maintaining formatting?

659 views Asked by At

I found borb - a cool Python package to analyze and create PDFs.

And there are several translation APIs available, e.g. Google Translate and DeepL.

(I realize the length of translated text is likely different than the original text, but to first order I'm willing to ignore this for now).

But I'm not clear from the borb documentation how to replace all texts with their translations, while maintaining all formatting.

2

There are 2 answers

4
Joris Schellekens On

Disclaimer: I am Joris Schellekens, the author of borb.

I don't think it will be easy to replace the text in the PDF. That's generally something that isn't really possible in PDF.

The problem you are facing is called "reflowing the content", the idea that you may cause a line of text to be longer/shorter. And then the whole paragraph changes. And perhaps the paragraph is part of a table, and the whole table needs to change, etc.

There are a couple of quick hacks.

You could write new content on top of the pdf, in a separate layer. The PDF spec calls this "optional content groups".

There is code in borb that does this already (the code related to OCR).

0
Thomas On

Unfortunately, there is no easy free or foolproof way to translate pdf documents and maintain document formatting.

DeepL's new Python Library allows for full document translation in this manner:

import deepl
auth_key = "YOUR_AUTH_KEY"
translator = deepl.Translator(auth_key)
translator.translate_document_from_filepath(
    "path/to/original/file.pdf",
    "path/to/write/translation/to.pdf",
    target_lang="EN-US"
)

and the company now offers a free API with a character limit. If you have a few short pdfs you'd like to translate, this will probably be the way to go.

If you have many, longer pdfs and don't mind paying a base of $5.49/month + $25.00 per 1 million characters translated, the DeepL API is still probably the way to go.

EDIT: After attempting to use the DeepL full document translation feature with Mandarin text, this method is far from foolproof/accurate. At least with the Mandarin documents I examined, the formatting of each document varied significantly, and DeepL was unable to accurately translate full documents over a wide range of formatting. If you need only the rough translation of a document, I would still recommend using DeepL's doc translator. However, if you need a high degree of accuracy, there won't be an 'easy' way to do this (read the rest of the answer). Again, however, I have only tried this feature using mandarin pdf files.


However, if you'd like to focus on text extraction, translation, and formatting without using DeepL's full document translation feature, and are able to sink some real time into building a software that can do this, I would recommend using pdfplumber. While it has a steep learning curve, it is an incredibly powerful tool that provides data on each character in the pdf, image area information, offers visual debugging tools, and has table extraction tools. It is important to note that it only works machine generated pdfs, and has no OCR feature.

Many of the pdf's I deal with are in the Mandarin language and have characters that are listed out of order, but using the data that pdfplumber provides on each character, it is possible to determine their position on the page...for instance, if character n's Distance of left side of character from left side of page (char properties section of the docs) is less than the distance for character n+1, and each has the same Distance of top of character from bottom of page, then it can be reasonably assumed that they are on the same line.

Figuring out what looks the most typical for the body of pdfs that you typically work with is a long process, but performing the text extraction while maintaining line fidelity in this manner can be done with a high degree of accuracy. After extraction, passing the strings to DeepL and writing them in an outfile is an easy task.

If you can provide one of the pdfs you work with for testing that would be helpful!