Getting error Invalid memory access when use Tesseract (Tess4j) in JAVA project Netbeans

Question

Getting error Invalid memory access when use Tesseract (Tess4j) in JAVA project Netbeans

723 views Asked by Hérick Raposo At 01 June 2022 at 13:39

I'm developing a routine in my JAVA Web project (Using netbeans 13) where I extract the texts from a pdf. If it doesn't find a certain term, it converts the pdf to image and tries to extract the text with OCR tesseract. After several attempts I always got the same error regardless of the configuration I did. Following error:

Error opening data file tessdata/por.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'por'
Tesseract couldn't load any languages!

Caused by: java.lang.Error: Invalid memory access

I'm having difficulties configuring the library in my development environment because, unlike the tutorials where SpringBoot and Eclipse are used, I use Netbeans 13 and I have the following project structure:

First I added the dependency in pom.xml:

<!-- https://mvnrepository.com/artifact/net.sourceforge.tess4j/tess4j -->
        <dependency>
            <groupId>net.sourceforge.tess4j</groupId>
            <artifactId>tess4j</artifactId>
            <version>5.2.0</version>
        </dependency>

After that as indicated in some tutorials, I went to my tessdata folder which is in dependencies/Tes4j and copied the tessdata folder to the web-inf folder (I also tried to the resources folder).

Later I tried to configure the environment variable TESSDATA_PREFIX. I couldn't find information if this variable was to be defined in the windows system variables or if there is some other place in netbeans to define this. In my project structure I tried three definitions but none worked:

C:\Programacao\myProjectName\Tess4J
C:\Programacao\myProjectName\src\main\webapp\WEB-INF
C:\Programacao\SPE-V2\src\main\webapp\resources

In the code of my method I tried to set the datapath passing only the name tessdata, data and I also tried to point out the paths above. Follow created method:

 public String extractText(Anexos anexo) throws Exception {
        File file = new File(anexo.getCaminho());
        PDDocument doc = PDDocument.load(file);

        System.out.println("=================================> Extraindo com pdfBox <=========================");
        PDFTextStripper estripador = new PDFTextStripper();
        estripador.setSortByPosition(false); 
        String pdfTexto = estripador.getText(doc);
        String line = "";

        line = pdfTexto.toLowerCase().replaceAll(AplicacaoBean.CARACTERES_ESPECIAIS_REGEX, "")
                .replaceAll("\\s", " ");
        
        if (line.contains("sped")) {
            return line;
        } else {

            PDFRenderer pdfRenderer = new PDFRenderer(doc);
            StringBuilder out = new StringBuilder();

            Tesseract tesseract = new Tesseract();
            tesseract.setLanguage("por");
            tesseract.setOcrEngineMode(1);

            Path dataDirectory = Paths.get("tessdata");
            tesseract.setDatapath(dataDirectory.toString());

            for (int page = 0; page < doc.getNumberOfPages(); page++) {
                BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);

                // Create a temp image file
                File tempFile = new File(file.getPath().replace(File.separator + anexo.getAnexo(), "") + File.separator + "tempfile_" + anexo.getAnexo().replaceAll("\\..*", "") + "_" + page + ".png");
                ImageIO.write(bufferedImage, "png", tempFile);
                String result = tesseract.doOCR(tempFile);
                out.append(result);

                // Delete temp file
                tempFile.delete();
            }

            line = out.toString().toLowerCase().replaceAll(AplicacaoBean.CARACTERES_ESPECIAIS_REGEX, "")
                    .replaceAll("\\s", " ");
        }

        return line;
    }

My tessdata folder:

My development environment:

Netbeans:13

JDK 15

tess4j:5.2.0

My doubts:

So I would like to know if this environment variable () is configured in windows variables, in netbeans or in some internal place in my code? Also, did I skip any steps? Do I need to download anything else? Please help me, I don't know what else to do!

Original Q&A

There are 1 answers

**nguyenq** · Answer 1 · 2022-06-05T12:40:06+00:00

nguyenq On 05 June 2022 at 12:40

It's likely you're using a language pack not compatible with the specified OcrEngineMode. So either use a compatible one (https://github.com/tesseract-ocr/) or use an OcrEngineMode compatible with the language pack.

TechQA.

Getting error Invalid memory access when use Tesseract (Tess4j) in JAVA project Netbeans

There are 1 answers

Related Questions in JAVA

Related Questions in TESSERACT

Related Questions in TESS4J

Popular Questions

Trending Questions