I am trying to read pdf using pypdf. PDF contains non-english text here it's Hindi langauge text
pdf_file = open("/content/drive/MyDrive/DARE_Hindi_AR_2014-15-1-1.pdf", 'rb') # pdf link is https://icar.org.in/sites/default/files/inline-files/DARE_Hindi_AR_2014-15-1-1.pdf
pdf_reader = PyPDF2.PdfReader(pdf_file)
pdf_text = ""
for page_num in range(len(pdf_reader.pages)):
page = pdf_reader.pages[page_num]
pdf_text += page.extract_text()
When I print pdf_text I get
•ŸÈ‚¥œÊŸ ∞fl¥ Áfl∑§Ê‚ ‚¥’¥œË ¬˝ÿʂ٥ ‚ ’˝«-ª„Í¥ ¡ËŸÙ◊ ∑‘§ ∑˝§◊ ∑§Ê ◊‚ıŒÊ πÙ¡Ÿ ÃÕÊ •‚Ë‹ ¬ˇÊË ∑‘§ ‚ê¬Íáʸ ¡ËŸÙ◊ ∑§Ê ÁŸœÊ¸⁄UáÊ∑§⁄UŸ fl ÃËŸ SÃ⁄UËÿ ‡ÊÍ∑§⁄U ífl⁄U ≈UË∑§Ê ¬˝÷ŒÙ¥, ¬S≈U «‚ ¬ÁS≈U‚ ⁄UÙ◊¥ÕË Áfl·ÊáÊÈ (¬Ë¬Ë•Ê⁄UflË) •ı⁄U ãÿÍ ∑Ò§‚‹ ⁄UÙª Áfl·ÊáÊÈ (∞Ÿ«ËflË) ∑‘§‚¥Èª⁄UË/~{ ¬˝÷Œ ∑§Ù ÃÒÿÊ⁄U ∑§⁄UŸÊ ©ÑπŸËÿ •ŸÈ‚¥œÊŸ ©¬‹Áéœ ⁄U„˄Җ “„USÃ-ÁŸŒ¸Á‡Êà ÄU‹ÙŸË∑§⁄UáÊ” ∑‘§ ◊Êäÿ◊ ‚ ©à¬ÛÊ ÷Ò¥‚ ∑§Ë ¬˝Õ◊∑§≈U«∏Ë “‹ÊÁ‹◊Ê” •ı⁄U ◊È⁄Uʸ ÷Ò¥‚ ∑§Ë ¬⁄UËÁˇÊà ‚¥ÃÁà ∑‘§ Á„◊Ë∑Χà flËÿ¸∑§Ë ∑§ÊÁÿ∑§ ∑§ÙÁ‡Ê∑§Ê ‚ ÄU‹ÙŸË∑Χà “⁄U¡Ã” ∑§Ê ¡ã◊ ©à‚Ê„flœ¸∑§ ⁄U„Ê–ªÙ¬‡Êȕ٥ ∑§Ë Œ‚Ë ŸS‹Ù¥ ‚¥∑§Ù⁄UË, ’‹Ê„Ë •ı⁄U ◊ÁáʬÈÈ⁄UË; ªÙ¡⁄UË ÷Ò¥‚;„Á⁄UŸÉÊÊ≈UÊ ∑§Ê‹Ë ◊Ȫ˸ ÃÕÊ å‹ÄU≈˛Ê∞¥ÁÕÿÊ‚ ∞À∑§Ù∑§Ë •ı⁄U ¬ê»‘§Á⁄U‚ ‚⁄UÿÍ ¡Ò‚Ë ◊¿Á‹ÿÙ¥ ∑§ ªÈáÊÊ¥ ∑§Ê ‹ˇÊáÊ-fláʸŸ fl ‚¥⁄UˇÊáÊ ◊„àfl¬Íáʸ ©¬‹ÁéœÿÊ¥ „Ò¥–
When I try to read another file - https://www.surveyofindia.gov.in/documents/coi-hindi.pdf , I get text similar to above.
What is the correct way to read the non-english pdf?
I have already included my attempt in question description
Q) What is the correct way to read the non-english pdf?
A) Depends on the Quality Assurance applied to the document, In this case there was none.
Most counties have a DDA or 508 equivalence which requires PDF's be Computer Readable for the Blind etc. This one has NOT been encoded to be readable.
Options are threefold
The Accessibility Statement is here https://www.surveyofindia.gov.in/pages/accessibility-statement
Sadly this does not bode well