OCR files on Alfresco using pypdfocr

177 views Asked by At

I can't OCR files on Alfresco using pypdfocr.

Hello everyone, I'm starting with Alfresco and I'm having some difficulty configuring pypdfocr in Alfresco and using it.

I installed Alfresco on an Ubunto 18.04.5 LTS, using this:

wget https://download.alfresco.com/release/community/201707-build-00028/alfresco-community-installer-201707-linux-x64.bin

I have already done all the necessary configuration, added the files repo.jar and share.jar in the respective folders:

/opt/alfresco-community/modules/platform/simple-ocr-repo-2.3.1.jar
/opt/alfresco-community/modules/share/simple-ocr-share-2.3.1.jar

I added the properties in alfresco-global.properties:

# PYPDFOCR
ocr.command = /opt/alfresco-community/scripts/ocr.sh
ocr.output.verbose = true
ocr.output.file.prefix.command =

ocr.extra.commands = -v -l por
ocr.server.os = linux

I created the script called in the code above:

#!/usr/bin/env bash
# set -o xtrace # Uncomment for debugging / troubleshooting
array = ("$ @")
unset "array [$ {# array [@]} - 1]"
/usr/local/bin/pypdfocr "$ {array [@]}"

I installed the dependencies like this: apt install gcc libjpeg-dev minizip zlib1g-dev python-dev

However, when I try to perform OCR inside Alfresco, I am getting the following message in /tomcat/logs/:

catalina.out

any help will be appreciated

**** I've tried to solve by installing more dependencies, however it didn't work:

apt-get install wget gcc gcc-c ++ make autoconf automake libtool libjpeg-devel libpng-devel libtiff-devel zlib-devel ocaml ImageMagick ImageMagick-devel

I get the following message:

E: Unable to locate package gcc-c +
E: Couldn't find any package by regex 'gcc-c +'
E: Unable to locate package libjpeg-devel
E: Unable to locate package libpng-devel
E: Unable to locate package libtiff-devel
E: Unable to locate package zlib-devel
E: Unable to locate package ImageMagick
E: Unable to locate package ImageMagick-devel
1

There are 1 answers

0
Curtis On BEST ANSWER

It would appear that ImageMagick and/or poppler-utils needs installed.

To install ImageMagick: https://www.tutorialspoint.com/how-to-install-imagemagick-on-ubuntu To install poppler-utils: sudo apt-get install -y poppler-utils

NOTE: you'll need quite a few more dependencies to get this ocr module to work. Specifically, the following:

Tesseract and Leptonica: https://medium.com/@jjagadish.in/install-tesseract-3-04-on-centos-7-4573465d8867

As well as the following packages:

epel-release
python-pip
gcc
libjpeg
minizip
zlib
python
ghostscript

Once you install pip, you'll need to install pypdfocr and pyyaml:

pip install pypdfocr
pip install pyyaml

I would suggest getting it to work at the command line first using an example pdf:

/opt/alfresco-community/scripts/ocr.sh -v -l por test.pdf test.pdf