Tabula (And PDFPlumber) unable to extract accurately Thai characters from text-based PDF

76 views Asked by At

i used tabula to extract Thai characters from a pdf. It is a text based form with following text. XXXX

After running below code:

import tabula 
import pandas as pd

pdlist = tabula.read_pdf(r'subscription.PDF',panda_options{'header':None},pages="all", stream=True, encoding='UTF-8')
result_df=pd.concat(pdlist)
print(result_df)

I saw from the Print output (and excel export) that some characters are consistently wrong. For example, these few highlighted characters are missing some parts compared to the PDF original.

PDF ORIGINAL TEXT PDF

print(result_df) Python

Excel Export Excel Export

I tested using pdfplumber and faced same issue but for other characters. Anyone with experience in Thai (or asian) language is able to help in this? Is this something which I need to accept, or is there some tuning or code to add to improve ?

0

There are 0 answers