How to parse table in PDF for non-english language

Question

How to parse table in PDF for non-english language

1.1k views Asked by Egorsky At 12 December 2020 at 20:29

I was using Camelot and tabula for parsing a pdf file with Cyrillic symbols inside. But in the output CSV file, I got the messed-up font with no sign of Russian language.

What can help me to parse the pdf table in a non-English language?

import camelot
file = 'file-name.pdf'
tables = camelot.read_pdf(file, pages = "1-end", encoding='utf-8')

Output: 00550529-1295-06 -РўРљР 5.РЎРћ1 0520529-12955--0066--РўРўРљРљР Р 55--Р“Р“Р§Р§23 00552299--11229955--0066--РўРўРљРљР Р 55--Р“Р“Р§Р§45

Original Q&A

There are 1 answers

**Egorsky** · Accepted Answer · 2020-12-13T12:00:01+00:00

So, basically, Camelot is pretty good with Cyrillic.

pip install camelot-py[cv]
import pandas as pd
import camelot
file = 'file-name.pdf'
tables = camelot.read_pdf(file, pages = "4, 5", encoding='utf-8')
df_p4 = tables[0].df

The output will be pretty raw, needs cleaning, but symbols won't be broken which I assume is a good result.

TechQA.

How to parse table in PDF for non-english language

There are 1 answers

Related Questions in PYTHON-3.X

Related Questions in PARSING

Related Questions in PDF

Related Questions in PYTHON-CAMELOT

Popular Questions

Popular Tags

Trending Questions