CID encoding of font

121 views Asked by At

I'am trying to extrat text from a pdf with python. None of the packages I tried could read it (PyPDF2,pdfminer,fitz etc.), but some of them could return me the cid encodings. (eg. (cid:3) ).

Now I read the file the "brute force" way, meaning I managed to found out the cid decoding from some examples. (That notebook can be found here on kaggle.)

I searched online for the elegant way, and found a lot of mentioning of Registry-Ordering-Supplement and how you should find the encodings by knowing the font.

Altough fitz can not interpret the text, it says the font is CourierNewPSMT. Now even with this information, I could not find the ROS info/ CID encoding/ CID mapping / CID collection.

Can someone tell me, how to interpret the cid encoded text, knowing the font?

2

There are 2 answers

1
K J On BEST ANSWER

What is needed is a PDF editor that recodes missing characters otherwise you may as well discard the plain text. So for such a task use the tools suited to the task, which here needs visual mapping of bad to expected. This took less time than shown here in a GUI editor remap dialog. Many are available but as commercially licensed (I think I paid about $15) I will not promote any one. enter image description here

Once the characters are remapped it is easier to use Python extraction such as here to the console or to a file, or modify the PDF many other ways.

enter image description here

0
iPDFdev On

nup_encoded.pdf - the text in the PF file is not prepared for text extraction, the font is missing the ToUnicode cmap.

The text is displayed using the actual glyph indices and not character codes. What you see as letter 'A', in PDF is 'display the glyph image at index 1' where the glyph image is a vector drawing of letter 'A'. The font does not include the ToUnicode cmap which provides the mapping between glyph index 1 and letter 'A' as this structure is required only for text extraction and not for text display.

The 'ROS info/ CID encoding/ CID mapping / CID collection' do not help you here.