Reading Non-english text using pypdf2

Question

Reading Non-english text using pypdf2

50 views Asked by datagekk At 29 September 2023 at 11:07

I am trying to read pdf using pypdf. PDF contains non-english text here it's Hindi langauge text

pdf_file = open("/content/drive/MyDrive/DARE_Hindi_AR_2014-15-1-1.pdf", 'rb') # pdf link is https://icar.org.in/sites/default/files/inline-files/DARE_Hindi_AR_2014-15-1-1.pdf

pdf_reader = PyPDF2.PdfReader(pdf_file)

pdf_text = ""

for page_num in range(len(pdf_reader.pages)):
    page = pdf_reader.pages[page_num]
    pdf_text += page.extract_text()

When I print pdf_text I get

•ŸÈ‚¥œÊŸ ∞ﬂ¥ Áﬂ∑§Ê‚ ‚¥’¥œË ¬˝ÿÊ‚Ù¥ ‚ ’˝«-ª„Í¥ ¡ËŸÙ◊ ∑‘§ ∑˝§◊ ∑§Ê ◊‚ıŒÊ πÙ¡Ÿ ÃÕÊ •‚Ë‹ ¬ˇÊË ∑‘§ ‚ê¬ÍáÊ¸ ¡ËŸÙ◊ ∑§Ê ÁŸœÊ¸⁄UáÊ∑§⁄UŸ ﬂ ÃËŸ SÃ⁄UËÿ ‡ÊÍ∑§⁄U íﬂ⁄U ≈UË∑§Ê ¬˝÷ŒÙ¥, ¬S≈U «‚ ¬ÁS≈U‚ ⁄UÙ◊¥ÕË Áﬂ·ÊáÊÈ (¬Ë¬Ë•Ê⁄UﬂË) •ı⁄U ãÿÍ ∑Ò§‚‹ ⁄UÙª Áﬂ·ÊáÊÈ (∞Ÿ«ËﬂË) ∑‘§‚¥Èª⁄UË/~{ ¬˝÷Œ ∑§Ù ÃÒÿÊ⁄U ∑§⁄UŸÊ ©ÑπŸËÿ •ŸÈ‚¥œÊŸ ©¬‹Áéœ ⁄U„Ë„Ò– “„USÃ-ÁŸŒ¸Á‡ÊÃ ÄU‹ÙŸË∑§⁄UáÊ” ∑‘§ ◊Êäÿ◊ ‚ ©à¬ÛÊ ÷Ò¥‚ ∑§Ë ¬˝Õ◊∑§≈U«∏Ë “‹ÊÁ‹◊Ê” •ı⁄U ◊È⁄UÊ¸ ÷Ò¥‚ ∑§Ë ¬⁄UËÁˇÊÃ ‚¥ÃÁÃ ∑‘§ Á„◊Ë∑Î§Ã ﬂËÿ¸∑§Ë ∑§ÊÁÿ∑§ ∑§ÙÁ‡Ê∑§Ê ‚ ÄU‹ÙŸË∑Î§Ã “⁄U¡Ã” ∑§Ê ¡ã◊ ©à‚Ê„ﬂœ¸∑§ ⁄U„Ê–ªÙ¬‡ÊÈ•Ù¥ ∑§Ë Œ‚Ë ŸS‹Ù¥ ‚¥∑§Ù⁄UË, ’‹Ê„Ë •ı⁄U ◊ÁáÊ¬ÈÈ⁄UË; ªÙ¡⁄UË ÷Ò¥‚;„Á⁄UŸÉÊÊ≈UÊ ∑§Ê‹Ë ◊ÈªË¸ ÃÕÊ å‹ÄU≈˛Ê∞¥ÁÕÿÊ‚ ∞À∑§Ù∑§Ë •ı⁄U ¬ê»‘§Á⁄U‚ ‚⁄UÿÍ ¡Ò‚Ë ◊¿Á‹ÿÙ¥ ∑§ ªÈáÊÊ¥ ∑§Ê ‹ˇÊáÊ-ﬂáÊ¸Ÿ ﬂ ‚¥⁄UˇÊáÊ ◊„àﬂ¬ÍáÊ¸ ©¬‹ÁéœÿÊ¥ „Ò¥–

When I try to read another file - https://www.surveyofindia.gov.in/documents/coi-hindi.pdf , I get text similar to above.

What is the correct way to read the non-english pdf?

I have already included my attempt in question description

Original Q&A

There are 1 answers

**K J** · Answer 1 · 2023-09-29T12:21:39+00:00

Q) What is the correct way to read the non-english pdf?

A) Depends on the Quality Assurance applied to the document, In this case there was none. Most counties have a DDA or 508 equivalence which requires PDF's be Computer Readable for the Blind etc. This one has NOT been encoded to be readable.

Options are threefold

1 Check you have the latest official publication, https://lddashboard.legislative.gov.in/sites/default/files/Constitution%20%20of%20India_Hindi.pdf and if still at fault , then under local "Freedom of Information" rules, ask for a DDA compatible copy of the document in question.
2 Use OCR and hope to get something that sounds about right, thus can be read by AI (Audio Interpreter). Usually a 5 - 20% fail in fine details like wrong or reversed characters.
3 Write a custom program to painstakingly alter each of 65,536 letters into the correct one, in some cases this may simply be replace the font but that rarely works. However I have heard some users here have good results.

The Accessibility Statement is here https://www.surveyofindia.gov.in/pages/accessibility-statement

We have put in our best efforts to ensure that all information on this website is accessible to people with disabilities. For example, a user with a visual disability can access this website using assistive technologies, such as screen readers ...

Sadly this does not bode well

TechQA.

Reading Non-english text using pypdf2

There are 1 answers

Related Questions in PYTHON-3.X

Related Questions in ENCODING

Related Questions in HINDI

Related Questions in PDF-READER

Related Questions in NON-ENGLISH

Popular Questions

Trending Questions