Reading Non-english text using pypdf2

50 views Asked by At

I am trying to read pdf using pypdf. PDF contains non-english text here it's Hindi langauge text

pdf_file = open("/content/drive/MyDrive/DARE_Hindi_AR_2014-15-1-1.pdf", 'rb') # pdf link is https://icar.org.in/sites/default/files/inline-files/DARE_Hindi_AR_2014-15-1-1.pdf

pdf_reader = PyPDF2.PdfReader(pdf_file)

pdf_text = ""

for page_num in range(len(pdf_reader.pages)):
    page = pdf_reader.pages[page_num]
    pdf_text += page.extract_text()

When I print pdf_text I get

•ŸÈ‚¥œÊŸ ∞fl¥ Áfl∑§Ê‚ ‚¥’¥œË ¬˝ÿʂ٥ ‚ ’˝«-ª„Í¥ ¡ËŸÙ◊ ∑‘§ ∑˝§◊ ∑§Ê ◊‚ıŒÊ πÙ¡Ÿ ÃÕÊ •‚Ë‹ ¬ˇÊË ∑‘§ ‚ê¬Íáʸ ¡ËŸÙ◊ ∑§Ê ÁŸœÊ¸⁄UáÊ∑§⁄UŸ fl ÃËŸ SÃ⁄UËÿ ‡ÊÍ∑§⁄U ífl⁄U ≈UË∑§Ê ¬˝÷ŒÙ¥, ¬S≈U «‚ ¬ÁS≈U‚ ⁄UÙ◊¥ÕË Áfl·ÊáÊÈ (¬Ë¬Ë•Ê⁄UflË) •ı⁄U ãÿÍ ∑Ò§‚‹ ⁄UÙª Áfl·ÊáÊÈ (∞Ÿ«ËflË) ∑‘§‚¥Èª⁄UË/~{ ¬˝÷Œ ∑§Ù ÃÒÿÊ⁄U ∑§⁄UŸÊ ©ÑπŸËÿ •ŸÈ‚¥œÊŸ ©¬‹Áéœ ⁄U„˄Җ “„USÃ-ÁŸŒ¸Á‡Êà ÄU‹ÙŸË∑§⁄UáÊ” ∑‘§ ◊Êäÿ◊ ‚ ©à¬ÛÊ ÷Ò¥‚ ∑§Ë ¬˝Õ◊∑§≈U«∏Ë “‹ÊÁ‹◊Ê” •ı⁄U ◊È⁄Uʸ ÷Ò¥‚ ∑§Ë ¬⁄UËÁˇÊà ‚¥ÃÁà ∑‘§ Á„◊Ë∑Χà flËÿ¸∑§Ë ∑§ÊÁÿ∑§ ∑§ÙÁ‡Ê∑§Ê ‚ ÄU‹ÙŸË∑Χà “⁄U¡Ã” ∑§Ê ¡ã◊ ©à‚Ê„flœ¸∑§ ⁄U„Ê–ªÙ¬‡Êȕ٥ ∑§Ë Œ‚Ë ŸS‹Ù¥ ‚¥∑§Ù⁄UË, ’‹Ê„Ë •ı⁄U ◊ÁáʬÈÈ⁄UË; ªÙ¡⁄UË ÷Ò¥‚;„Á⁄UŸÉÊÊ≈UÊ ∑§Ê‹Ë ◊Ȫ˸ ÃÕÊ å‹ÄU≈˛Ê∞¥ÁÕÿÊ‚ ∞À∑§Ù∑§Ë •ı⁄U ¬껑§Á⁄U‚ ‚⁄UÿÍ ¡Ò‚Ë ◊¿Á‹ÿÙ¥ ∑§ ªÈáÊÊ¥ ∑§Ê ‹ˇÊáÊ-fláʸŸ fl ‚¥⁄UˇÊáÊ ◊„àfl¬Íáʸ ©¬‹ÁéœÿÊ¥ „Ò¥–

When I try to read another file - https://www.surveyofindia.gov.in/documents/coi-hindi.pdf , I get text similar to above.

What is the correct way to read the non-english pdf?

I have already included my attempt in question description

1

There are 1 answers

0
K J On

Q) What is the correct way to read the non-english pdf?

A) Depends on the Quality Assurance applied to the document, In this case there was none. enter image description here Most counties have a DDA or 508 equivalence which requires PDF's be Computer Readable for the Blind etc. This one has NOT been encoded to be readable.

Options are threefold

  • 1 Check you have the latest official publication, https://lddashboard.legislative.gov.in/sites/default/files/Constitution%20%20of%20India_Hindi.pdf and if still at fault , then under local "Freedom of Information" rules, ask for a DDA compatible copy of the document in question.
  • 2 Use OCR and hope to get something that sounds about right, thus can be read by AI (Audio Interpreter). Usually a 5 - 20% fail in fine details like wrong or reversed characters.
  • 3 Write a custom program to painstakingly alter each of 65,536 letters into the correct one, in some cases this may simply be replace the font but that rarely works. However I have heard some users here have good results.

The Accessibility Statement is here https://www.surveyofindia.gov.in/pages/accessibility-statement

We have put in our best efforts to ensure that all information on this website is accessible to people with disabilities. For example, a user with a visual disability can access this website using assistive technologies, such as screen readers ...

Sadly this does not bode well enter image description here