how to change tesseract config to recognize § and apply with pdftools::pdf_ocr_text in R?

1.2k views Asked by At

I am using pdftools in R to extract text from both scanned and text based PDF files. One problem is with the § character. This is not recognized by tesseract.

I looked at the following links: CRAN tesseract package vignette

SO link of a similar question

and this github page

And I tried the following:

  1. I found the configuration files using tesseract_info() and edited the digits file under configs. The digits file content was like this:

    tessedit_char_whitelist 0123456789.

After editing it looks like this:

tessedit_char_whitelist 0123456789-$§.

This did not change anything at all, I am still not able to extract §. They still appear as 8.

  1. After the 1st step failed, I tried the following:

    filepng <- pdftools::pdf_convert(filePathPDF, dpi = 600)
    
    specs <- tesseract("deu", options = list(tessedit_char_whitelist = "1234567890-.,;:qwertzuiopüasdfghjklöäyxcvbnmQWERTZUIOPÜASDFGHJKLÖÄYXCVBNM@߀!$%&§/()=?+"))
    
    text <- tesseract::ocr(filepng, engine = specs)
    
    

This one failed too. I am by no means an expert on OCR and tesseract has room for improvements when it comes to documentation.

How can I add § to the list of characters to be recognized in the right way, so that it applies?

Update

The following works to recognize §, when I remove language from the argument list:

charlist <- tesseract(options = list(tessedit_char_whitelist = " 1234567890-.,;:qwertzuiopüasdfghjklöäyxcvbnmQWERTZUIOPÜASDFGHJKLÖÄYXCVBNM@߀!$%&§/()=?+"))

text <- tesseract::ocr(filepng, engine = charlist)

But this time, I am losing German umlauts. I cannot find out how I can specify the language and the char_whitelist at the same time. According to the documentation, tesseract() accepts language argument and options argument. But this does not seem to work. Any ideas?

Update: I tried using tesseract in command line (MacOS Catalina 10.15.7).

I converted a scanned PDF file first to an image then used this:

tesseract fileConverted.tiff fileToText

It creates fileToText.txt. It does recognize §. All of them are correctly recognized. But German umlauts are not recognized correctly, since I did not specify language at all. When I use the same command with the language argument

tesseract fileConverted.tiff fileToText -l deu

German umlauts are recognized properly but § is not.

The digits config file I changed is here:

/usr/local/Cellar/tesseract/4.1.1/share/tessdata/configs

My understanding is: it is not a problem specific to R, but it occurs with tesseract itself. Setting tessedit_char_whitelist and the language at the same time does not seem to be possible or I am missing something horribly.

1

There are 1 answers

0
Henry Mont On BEST ANSWER

As said above, tesseract 4 does not support setting a whitelist. To go around that problem, you could use the command-line switch. You need to set OCR Engine mode to the "Original Tesseract only" with --oem 0 then use -c tessedit_char_whitelist=abc... to pass your whitelist directly via the command-line.

Overall, it should look something like this : tesseract fileConverted.tiff fileToText --oem 0 -l deu -c tessedit_char_whitelist=0123456789-$§