How to parse table in PDF for non-english language

1.2k views Asked by At

I was using Camelot and tabula for parsing a pdf file with Cyrillic symbols inside. But in the output CSV file, I got the messed-up font with no sign of Russian language.

What can help me to parse the pdf table in a non-English language?

import camelot
file = 'file-name.pdf'
tables = camelot.read_pdf(file, pages = "1-end", encoding='utf-8')

Output: 00550529-1295-06 -ТКР5.СО1 0520529-12955--0066--ТТККРР55--ГГЧЧ23 00552299--11229955--0066--ТТККРР55--ГГЧЧ45

1

There are 1 answers

0
Egorsky On BEST ANSWER

So, basically, Camelot is pretty good with Cyrillic.

pip install camelot-py[cv]
import pandas as pd
import camelot
file = 'file-name.pdf'
tables = camelot.read_pdf(file, pages = "4, 5", encoding='utf-8')
df_p4 = tables[0].df

The output will be pretty raw, needs cleaning, but symbols won't be broken which I assume is a good result.