I am working with PDFs that contain tables with a mix of Greek characters and English letters (e.g., chemical formulas like α-pinene). I am using Tesseract OCR to extract the text, but it seems to only recognize the English letters, even though I have installed all the necessary language packages from tesseract-lang.
When I inspect the tables using table.metadata.text_as_html, the Greek letters are either missing or replaced with English ones. I suspect there might be a syntax error in my code. Here's a snippet of my code:
elements = partition_pdf(filename, languages=["eng", "ell"], strategy="auto", infer_table_structure=True, url=None, model_name = "yolox")
Partition_pdf is a function from the unstructured.io library. I have tried passing different language parameters to the languages argument, but the issue persists. Can anyone help me identify what I might be doing wrong or suggest a way to correctly extract both Greek and English characters from the PDF?