Tesseract in Unstructured not recognizing Greek characters in mixed-language PDF

412 views Asked by marialagerholm At 06 October 2023 at 09:27

I am working with PDFs that contain tables with a mix of Greek characters and English letters (e.g., chemical formulas like α-pinene). I am using Tesseract OCR to extract the text, but it seems to only recognize the English letters, even though I have installed all the necessary language packages from tesseract-lang.

When I inspect the tables using table.metadata.text_as_html, the Greek letters are either missing or replaced with English ones. I suspect there might be a syntax error in my code. Here's a snippet of my code:

elements = partition_pdf(filename, languages=["eng", "ell"], strategy="auto", infer_table_structure=True, url=None, model_name = "yolox")

Partition_pdf is a function from the unstructured.io library. I have tried passing different language parameters to the languages argument, but the issue persists. Can anyone help me identify what I might be doing wrong or suggest a way to correctly extract both Greek and English characters from the PDF?

Original Q&A

TechQA.

Tesseract in Unstructured not recognizing Greek characters in mixed-language PDF

There are 0 answers

Related Questions in PYTHON

Related Questions in PDF

Related Questions in NLP

Related Questions in TESSERACT

Related Questions in UNSTRUCTURED-DATA

Popular Questions

Popular Tags

Trending Questions