PDF to Text for Urdu and Arabic using Ghostscript

659 views Asked by At

I have a few PDF files which are in the Urdu language, and some of the PDF files are in the Arabic language.

I want to convert the PDF files to text format. I have issued the following Ghostscript code from the command line in my Windows 10 system:

gswin64c.exe -sDEVICE=txtwrite -o output.txt new.pdf

The text file is generated, however, the contents of text file is not in the Urdu language or Arabic language.

This is how it looks like (I have pasted a portion of output as it is huge):

ی첺جⰧ�� ہ셈ے

How can I properly convert PDF to text using Ghostscript?

1

There are 1 answers

2
KenS On BEST ANSWER

Well basically the answer is that the PDF files you have supplied have 'not terribly good' ToUnicode CMap tables.

Looking at your first file we see that it uses one font:

26 0 obj
<<
  /BaseFont /CCJSWK+JameelNooriNastaleeq
  /DescendantFonts 28 0 R
  /Encoding /Identity-H
  /Subtype /Type0
  /ToUnicode 29 0 R
  /Type /Font
>>
endobj

That has a ToUnicode CMap in object 29, the ToUnicode CMap is meant to map character codes to Unicode code points. Looking at the first piece of text as an example we see:

/C2_0 1 Tf
13 0 0 13 39.1302 561.97 Tm
<0003>Tj
/Span<</ActualText<FEFF0645062A>>> BDC 
<38560707>Tj

So that's character code 0x003 (notice no marked content for the first characetr), looking at the ToUnicode CMap we see:

<0003> <0020>

So character code 0x003 maps to Unicode point U+0020, a space. The next two character codes are 3856 and 0707. Again consulting the ToUnicode CMap we see:

<3856> <062A0645>

So that single character code maps to two Unicode code points, U+062A and U+0645, Which is 'Teh' ت and 'Meem' م

So far so good. The next code is 0707, when we look at the ToUnicode CMap it comes up as 0xFFFD, which is the 'replacement character' �. Obviously that's meaningless.

We then have this :

0.391 0 Td
[<011C07071FEE>1 <0003>243.8 <2E93>]TJ
/Span<</ActualText<FEFF0644>>> BDC 
<0707>Tj
EMC 

So that's character codes 0x011C, 0x0707, 0x1FEE, 0x0003, 0x2E93 followed by 0x0707. Notice that the final <0707> is associated with a Marked Content definition which says the ActualText is Unicode 0x0644, which is the 'Lam' glyph ل

So clearly the ToUnicode CMap should be mapping 0707 to U+0644, and it doesn't.

Now when given a ToUnicode CMap the text extraction code trusts it. So your problem with this file is that the ToUnicode CMap is 'wrong', and that's why the text is coming out incorrect.

I haven't tried to debug further through the file, it is possible there are other errors.

Your second file has this ToUnicode CMap:

26 0 obj
<<
  /Length 606
>>
stream
/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo <<
/Registry (AABACF+TT1+0) /Ordering (T42UV) /Supplement 0 >> def
/CMapName /AABACF+TT1+0 def
/CMapType 2 def
1 begincodespacerange <0003> <0707> endcodespacerange
15 beginbfchar
<0003> <0020>
<0011> <002E>
<00e7> <062A>
<00ec> <062F>
<00ee> <0631>
<00f3> <0636>
<00f8> <0641>
<00fa> <0644>
<00fc> <0646>
<00fe> <0648>
<0119> <0647>
<011a> <064A>
<0134> <0066>
<013b> <006D>
<0707> <2423>
endbfchar
2 beginbfrange
<00e4> <00e5> <0627>
<011f> <0124> <0661>
endbfrange
endcmap CMapName currentdict /CMap defineresource pop end end

The first text in the file is:

<3718>Tj

And again, that's not in the CMap. Because the text extraction code prioritises the CMAp (because it's usually reliable), the missing entries cause the extraction to basically fail.

In addition to the fact that the ToUnicode CMaps are incorrect, the embedded fonts are subset and use an Identity-H CMap for drawing. That eliminates another source of information we could use.

Fundamentally the only way you're going to get text out of that PDF fie is manual transcription or OCR software.

Since you are using Ghostscript on Windows, the distributed binary includes Tesseract so you could try using that with pdfwrite and an Urdu training file to produce a PDF file with a possibly better ToUnicode CMap. You could then extract the text from that PDF file.

You would have to tell the pdfwrite device not to use the embedded ToUnicode CMaps, see the UseOCR switch documented here https://ghostscript.com/doc/9.56.1/VectorDevices.htm#PDF

And information on setting up the OCR engine and getting output here https://ghostscript.com/doc/9.56.1/Devices.htm#OCR-Devices

You may get better results by using an 'image' OCR output and then using the text extraction on that file to get the text out.