I'm developing a routine in my JAVA Web project (Using netbeans 13) where I extract the texts from a pdf. If it doesn't find a certain term, it converts the pdf to image and tries to extract the text with OCR tesseract. After several attempts I always got the same error regardless of the configuration I did. Following error:
Error opening data file tessdata/por.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'por'
Tesseract couldn't load any languages!
Caused by: java.lang.Error: Invalid memory access
I'm having difficulties configuring the library in my development environment because, unlike the tutorials where SpringBoot and Eclipse are used, I use Netbeans 13 and I have the following project structure:
First I added the dependency in pom.xml:
<!-- https://mvnrepository.com/artifact/net.sourceforge.tess4j/tess4j -->
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>5.2.0</version>
</dependency>
After that as indicated in some tutorials, I went to my tessdata folder which is in dependencies/Tes4j and copied the tessdata folder to the web-inf folder (I also tried to the resources folder).
Later I tried to configure the environment variable TESSDATA_PREFIX. I couldn't find information if this variable was to be defined in the windows system variables or if there is some other place in netbeans to define this. In my project structure I tried three definitions but none worked:
- C:\Programacao\myProjectName\Tess4J
- C:\Programacao\myProjectName\src\main\webapp\WEB-INF
- C:\Programacao\SPE-V2\src\main\webapp\resources
In the code of my method I tried to set the datapath passing only the name tessdata, data and I also tried to point out the paths above. Follow created method:
public String extractText(Anexos anexo) throws Exception {
File file = new File(anexo.getCaminho());
PDDocument doc = PDDocument.load(file);
System.out.println("=================================> Extraindo com pdfBox <=========================");
PDFTextStripper estripador = new PDFTextStripper();
estripador.setSortByPosition(false);
String pdfTexto = estripador.getText(doc);
String line = "";
line = pdfTexto.toLowerCase().replaceAll(AplicacaoBean.CARACTERES_ESPECIAIS_REGEX, "")
.replaceAll("\\s", " ");
if (line.contains("sped")) {
return line;
} else {
PDFRenderer pdfRenderer = new PDFRenderer(doc);
StringBuilder out = new StringBuilder();
Tesseract tesseract = new Tesseract();
tesseract.setLanguage("por");
tesseract.setOcrEngineMode(1);
Path dataDirectory = Paths.get("tessdata");
tesseract.setDatapath(dataDirectory.toString());
for (int page = 0; page < doc.getNumberOfPages(); page++) {
BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);
// Create a temp image file
File tempFile = new File(file.getPath().replace(File.separator + anexo.getAnexo(), "") + File.separator + "tempfile_" + anexo.getAnexo().replaceAll("\\..*", "") + "_" + page + ".png");
ImageIO.write(bufferedImage, "png", tempFile);
String result = tesseract.doOCR(tempFile);
out.append(result);
// Delete temp file
tempFile.delete();
}
line = out.toString().toLowerCase().replaceAll(AplicacaoBean.CARACTERES_ESPECIAIS_REGEX, "")
.replaceAll("\\s", " ");
}
return line;
}
My tessdata folder:
My development environment:
Netbeans:13
JDK 15
tess4j:5.2.0
My doubts:
So I would like to know if this environment variable () is configured in windows variables, in netbeans or in some internal place in my code? Also, did I skip any steps? Do I need to download anything else? Please help me, I don't know what else to do!


It's likely you're using a language pack not compatible with the specified OcrEngineMode. So either use a compatible one (https://github.com/tesseract-ocr/) or use an OcrEngineMode compatible with the language pack.