PDF2image on AWS Lambda - resulted PNG has wrong fonts

554 views Asked by At

I am using pdf2image convert_from_bytes on my own PDFs in order to get them in PNG format. The context is AWS Lambda, py 3.8.

...
images = convert_from_bytes(infile,
                            dpi=DPI,
                            fmt=FMT)

for page_num, image in enumerate(images):
    location = "png/" + event.key.split('.')[0] + "-page" + str(page_num) + '.' + FMT

    buffer = BytesIO()
    image.save(buffer, FMT.upper())
    buffer.seek(0)
    ...

Although I am able to generate a PNG "correctly" (meaning with all the info & text), the resulted PNG seems to be using Times New Roman during the process as the font for every single paragraph I have in the PDF. Meanwhile the PDF itself shows correctly with the right fonts and I made sure it has the fonts embedded through properties. The problem happen only when I try to convert it to PNG format. Also I am not trying to use any fancy fonts, only Courrier-Bold and Helvetica.

Here an example of a pdf (part of it): Good fonts

And the result image: enter image description here

What did I try so far ?

  • I tried to convert my PDFs using some online tools to see if this works or if the PDF itself was an issue. The PNG image was correct with the right fonts.
  • I tried to process some random PDFs with my Lambda function and the generated PNG had correct fonts as well so the conversion seems to work on most PDFs.
  • I tried with a few different fonts and same result.
  • I tried to embbed the font in AWS lambda following somewhat this Include custom fonts in AWS Lambda but no luck

But at this point I am clueless. Any idea how can I debug ?

EDIT: PDF font properties font props

EDIT2: I wrote a small python program to generate a sentence per existing base font. before

Then when I pass it through the lambda I get this: enter image description here

0

There are 0 answers