I'm using Camelot to extract text from pdfs. The pdfs also contains Chinese characters for which Camelot prints Cid of the same. eg (cid:3634)
I want to strip off those CID as Chinese characters are not imp to me.
I tried this:
>>> tables = camelot.read_pdf('D:/iolo/1. Hangcha/1. FORKLIFTS ELECTRIC/2. NK15E - 3 WHEEL - NEW-(2014)/copy.pdf',pages='12',strip_text='(cid:[0-9])')
But only removes the CID frame and not numbers in it.
See example Output Image Here Please help.
At the moment, Camelot parameter
strip_text
doesn't support regular expressions (see official repository).Instead, you can use Pandas
replace
method: