how to use tessdata_best for tesseract (pytesseract). What are the arguments and procedure?

4.8k views Asked by At

TL;DR: How do I install tessdata_best to use withpytesseract inside conda in Ubuntu 18?

I have been using pytesseract inside conda environment for quite some but there is a need to improve the accuracy and I found out that tessdata_best gives you the best accuracy. How can I install and use that version? I am using Ubuntu 18 and have to work with pytesseract.

I have my tesseract installed at /usr/share/tesseract-ocr/ and inside it there is only 1 tessdata.

Do I need to get the tessdata_best from github by copying it to the directory /usr/share/tesseract-ocr/ alongside tessdata?

Even then, if I want to use tessdata-best, what do I have to use? Do I need to change the config as --oem 0/1/2/3?

Third and last thing is that I have my language.trainedata files at /home/deshwal/anaconda3/envs/py36/share/tessdata/eng.traineddata. Do I need to paste the tessdata_best at this location too? Becuse when I try to change the language dir, it gives me error as as:

/home/deshwal/anaconda3/envs/py36/share/tessdata/equ.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'equ\' Tesseract couldn\'t load any languages! Could not initialize tesseract.'

2

There are 2 answers

0
Maulik Kayastha On

I dont know if I understand your question clearly, however let me know if below helps ... You need to set datapath with location where you will copy the tessdata_best training models, For example,

Tesseract tesseract = new Tesseract(); // JNA Interface Mapping tesseract.setDatapath("/home/tesseract/tessdata_best_4_0_0/tessdata");

All your .traineddata files which you downloaded from (https://github.com/tesseract-ocr/tessdata_best) should be placed in the directory you define in setDataPath (for example:, /home/tesseract/tessdata_best_4_0_0/tessdata).

Please note: These models only work with the LSTM OCR engine of Tesseract 4 so make sure you have used library 4.1 or above.

Regards, Maulik

0
Rachid Benouini On

According to the documentation of pytesseract, you can use config argument with --tessdata-dir, as follows :

# Example config: r'--tessdata-dir "C:\Program Files (x86)\Tesseract-OCR\tessdata"'
# It's important to add double quotes around the dir path.
tessdata_dir_config = r'--tessdata-dir "<replace_with_your_tessdata_dir_path>"'
pytesseract.image_to_string(image, lang='chi_sim', config=tessdata_dir_config)

For more details see https://pypi.org/project/pytesseract/.