How to strip (CID:) when using camelot to extract text from pdfs

Question

How to strip (CID:) when using camelot to extract text from pdfs

1.1k views Asked by Jagrut Panchal At 13 October 2020 at 13:41

I'm using Camelot to extract text from pdfs. The pdfs also contains Chinese characters for which Camelot prints Cid of the same. eg (cid:3634)

I want to strip off those CID as Chinese characters are not imp to me.

I tried this:

>>> tables = camelot.read_pdf('D:/iolo/1.  Hangcha/1.  FORKLIFTS ELECTRIC/2.  NK15E - 3 WHEEL - NEW-(2014)/copy.pdf',pages='12',strip_text='(cid:[0-9])')

But only removes the CID frame and not numbers in it.

See example Output Image Here Please help.

Original Q&A

There are 1 answers

**Stefano Fiorucci - anakin87** · Accepted Answer · 2020-10-15T12:58:50+00:00

Stefano Fiorucci - anakin87 On 15 October 2020 at 12:58 BEST ANSWER

At the moment, Camelot parameter strip_text doesn't support regular expressions (see official repository).

Instead, you can use Pandas replace method:

for table in tables:
    table.df.replace(to_replace='\(cid\:[0-9]+\)', value='', inplace=True, regex=True)

TechQA.

How to strip (CID:) when using camelot to extract text from pdfs

There are 1 answers

Related Questions in PYTHON-3.X

Related Questions in PDFMINER

Related Questions in PYTHON-CAMELOT

Popular Questions

Popular Tags

Trending Questions