TechQA.

Tabula (And PDFPlumber) unable to extract accurately Thai characters from text-based PDF

69 views Asked by theluncheonmeat At 02 December 2023 at 10:27

i used tabula to extract Thai characters from a pdf. It is a text based form with following text. XXXX

After running below code:

import tabula 
import pandas as pd

pdlist = tabula.read_pdf(r'subscription.PDF',panda_options{'header':None},pages="all", stream=True, encoding='UTF-8')
result_df=pd.concat(pdlist)
print(result_df)

I saw from the Print output (and excel export) that some characters are consistently wrong. For example, these few highlighted characters are missing some parts compared to the PDF original.

PDF ORIGINAL TEXT

print(result_df)

Excel Export

I tested using pdfplumber and faced same issue but for other characters. Anyone with experience in Thai (or asian) language is able to help in this? Is this something which I need to accept, or is there some tuning or code to add to improve ?

There are 0 answers