I am using pdf2image convert_from_bytes
on my own PDFs in order to get them in PNG format. The context is AWS Lambda, py 3.8.
...
images = convert_from_bytes(infile,
dpi=DPI,
fmt=FMT)
for page_num, image in enumerate(images):
location = "png/" + event.key.split('.')[0] + "-page" + str(page_num) + '.' + FMT
buffer = BytesIO()
image.save(buffer, FMT.upper())
buffer.seek(0)
...
Although I am able to generate a PNG "correctly" (meaning with all the info & text), the resulted PNG seems to be using Times New Roman during the process as the font for every single paragraph I have in the PDF. Meanwhile the PDF itself shows correctly with the right fonts and I made sure it has the fonts embedded through properties. The problem happen only when I try to convert it to PNG format. Also I am not trying to use any fancy fonts, only Courrier-Bold and Helvetica.
Here an example of a pdf (part of it):
What did I try so far ?
- I tried to convert my PDFs using some online tools to see if this works or if the PDF itself was an issue. The PNG image was correct with the right fonts.
- I tried to process some random PDFs with my Lambda function and the generated PNG had correct fonts as well so the conversion seems to work on most PDFs.
- I tried with a few different fonts and same result.
- I tried to embbed the font in AWS lambda following somewhat this Include custom fonts in AWS Lambda but no luck
But at this point I am clueless. Any idea how can I debug ?
EDIT2: I wrote a small python program to generate a sentence per existing base font.