Arabic pdf text extraction

Question

Arabic pdf text extraction

2.1k views Asked by B.A At 09 June 2022 at 11:45

I'm trying to extract text from Arabic pdfs - raw data extraction not OCR -.

I tried many packages, tools and none of them worked, python packages, pdfBox, adobe API, and many other tools and all of them field to extract the text correctly, either it reads the text LTR or it do wrong decoding.

Here is a two sample from different tools
sample 1:

املحتويات

7 الثانية الطبعة مقدمة
9 وتاريخه األدب -١
51 الجاهليون -٢
95 الشعر نحل أسباب -٣
149 والشعراء الشعر -٤
213 مرض شعر -٥
271 الشعر -٦
285 الجاهيل النثر -٧

sample 2:

ﺔﻴﻧﺎﺜﻟا ﺔﻌﺒﻄﻟا ﺔﻣﺪﻘﻣ
ﻪﺨﻳرﺎﺗو بدﻷا -١
نﻮﻴﻠﻫﺎﺠﻟا -٢
ﺮﻌﺸﻟا ﻞﺤﻧ بﺎﺒﺳأ -٣
ءاﺮﻌﺸﻟاو ﺮﻌﺸﻟا -٤
ﴬﻣ ﺮﻌﺷ -٥
ﺮﻌﺸﻟا -٦
ﲇﻫﺎﺠﻟا ﺮﺜﻨﻟا -٧

original text and yes I can copy it and get the same rendered text.

are there any tool that can extract Arabic text correctly

the book link can be found here

Original Q&A

There are 3 answers

Fayez On 20 December 2023 at 07:49

one can also open an Arabic PDF file using MS Word and then save it as a .docx format. This approach can yield a more accurate Arabic transformation from PDF to DOCX. It is recommended to use the latest version of MS Word for enhanced compatibility and better results.

Ahmed Ayman On 03 October 2022 at 08:58

Unicode Normalization should be fixing that issue. (you can choose NFKC)

Most programming languages have a normal. check here for more info about normalization. https://unicode.org/reports/tr15/

**K J** · Accepted Answer · 2022-06-09T14:59:21+00:00

The text in a PDF is not the same as the text used for its construction, we can see that in your example where page 7 is shown in Arabic on the surface but is coded as 7 in the plain text.

However a greater problem is the Languages as supported by fonts, so in Notepad I had to accept a script font to see a similarity, but that is using a font substitution.

Another complication is Unicode and whitespace ordering.

so the result from

pdftotext -f 5 -l 5 في_الأدب_الجاهلي.pdf try.txt

At best will look like

Thus in summary your Sample 1 is equal if not better, than any other simple attempt.

Later Edit from B.A. comment below

I found a way to go around this, after extracting the text I open the txt file and normalize its content using unicodedata python module that offers unicodedata.normalize() function. So I can now say that pdftotext is the best tool for Arabic text extraction

TechQA.

Arabic pdf text extraction

There are 3 answers

Related Questions in PDF

Related Questions in TEXT-EXTRACTION

Related Questions in PDF-PARSING

Related Questions in PDFPARSER

Related Questions in PDFTEXTSTREAM

Popular Questions

Popular Tags

Trending Questions