i used tabula to extract Thai characters from a pdf. It is a text based form with following text. XXXX
After running below code:
import tabula
import pandas as pd
pdlist = tabula.read_pdf(r'subscription.PDF',panda_options{'header':None},pages="all", stream=True, encoding='UTF-8')
result_df=pd.concat(pdlist)
print(result_df)
I saw from the Print output (and excel export) that some characters are consistently wrong. For example, these few highlighted characters are missing some parts compared to the PDF original.
I tested using pdfplumber and faced same issue but for other characters. Anyone with experience in Thai (or asian) language is able to help in this? Is this something which I need to accept, or is there some tuning or code to add to improve ?