Arabic pdf text extraction

2.1k views Asked by At

I'm trying to extract text from Arabic pdfs - raw data extraction not OCR -.

I tried many packages, tools and none of them worked, python packages, pdfBox, adobe API, and many other tools and all of them field to extract the text correctly, either it reads the text LTR or it do wrong decoding.

Here is a two sample from different tools
sample 1:

املحتويات

7 الثانية الطبعة مقدمة
9 وتاريخه األدب -١
51 الجاهليون -٢
95 الشعر نحل أسباب -٣
149 والشعراء الشعر -٤
213 مرض شعر -٥
271 الشعر -٦
285 الجاهيل النثر -٧

sample 2:

ﺔﻴﻧﺎﺜﻟا ﺔﻌﺒﻄﻟا ﺔﻣﺪﻘﻣ
ﻪﺨﻳرﺎﺗو بدﻷا -١
نﻮﻴﻠﻫﺎﺠﻟا -٢
ﺮﻌﺸﻟا ﻞﺤﻧ بﺎﺒﺳأ -٣
ءاﺮﻌﺸﻟاو ﺮﻌﺸﻟا -٤
ﴬﻣ ﺮﻌﺷ -٥
ﺮﻌﺸﻟا -٦
ﲇﻫﺎﺠﻟا ﺮﺜﻨﻟا -٧

original text enter image description here and yes I can copy it and get the same rendered text.

are there any tool that can extract Arabic text correctly

the book link can be found here

3

There are 3 answers

3
K J On BEST ANSWER

The text in a PDF is not the same as the text used for its construction, we can see that in your example where page 7 is shown in Arabic on the surface but is coded as 7 in the plain text.

enter image description here

However a greater problem is the Languages as supported by fonts, so in Notepad I had to accept a script font to see a similarity, but that is using a font substitution.

Another complication is Unicode and whitespace ordering.

so the result from

pdftotext -f 5 -l 5 في_الأدب_الجاهلي.pdf try.txt

At best will look like enter image description here

Thus in summary your Sample 1 is equal if not better, than any other simple attempt.

Later Edit from B.A. comment below

I found a way to go around this, after extracting the text I open the txt file and normalize its content using unicodedata python module that offers unicodedata.normalize() function. So I can now say that pdftotext is the best tool for Arabic text extraction

0
Fayez On

one can also open an Arabic PDF file using MS Word and then save it as a .docx format. This approach can yield a more accurate Arabic transformation from PDF to DOCX. It is recommended to use the latest version of MS Word for enhanced compatibility and better results.

0
Ahmed Ayman On

Unicode Normalization should be fixing that issue. (you can choose NFKC)

Most programming languages have a normal. check here for more info about normalization. https://unicode.org/reports/tr15/