Tesseract in Unstructured not recognizing Greek characters in mixed-language PDF

416 views Asked by At

I am working with PDFs that contain tables with a mix of Greek characters and English letters (e.g., chemical formulas like α-pinene). I am using Tesseract OCR to extract the text, but it seems to only recognize the English letters, even though I have installed all the necessary language packages from tesseract-lang.

When I inspect the tables using table.metadata.text_as_html, the Greek letters are either missing or replaced with English ones. I suspect there might be a syntax error in my code. Here's a snippet of my code:

elements = partition_pdf(filename, languages=["eng", "ell"], strategy="auto", infer_table_structure=True, url=None, model_name = "yolox")

Partition_pdf is a function from the unstructured.io library. I have tried passing different language parameters to the languages argument, but the issue persists. Can anyone help me identify what I might be doing wrong or suggest a way to correctly extract both Greek and English characters from the PDF?

0

There are 0 answers