How to strip (CID:) when using camelot to extract text from pdfs

1.1k views Asked by At

I'm using Camelot to extract text from pdfs. The pdfs also contains Chinese characters for which Camelot prints Cid of the same. eg (cid:3634)

I want to strip off those CID as Chinese characters are not imp to me.

I tried this:

>>> tables = camelot.read_pdf('D:/iolo/1.  Hangcha/1.  FORKLIFTS ELECTRIC/2.  NK15E - 3 WHEEL - NEW-(2014)/copy.pdf',pages='12',strip_text='(cid:[0-9])')

But only removes the CID frame and not numbers in it.

See example Output Image Here Please help.

1

There are 1 answers

0
Stefano Fiorucci - anakin87 On BEST ANSWER

At the moment, Camelot parameter strip_text doesn't support regular expressions (see official repository).

Instead, you can use Pandas replace method:

for table in tables:
    table.df.replace(to_replace='\(cid\:[0-9]+\)', value='', inplace=True, regex=True)