PDFPlumber returning symbols and inaccurate text

907 views Asked by At

I'm trying to extract text from a pdf file using PDFplumber

import pdfplumber
pdf = pdfplumber.open(r"https://www.lupin.com/pdf/financials/subsidiaries/multicare-pharmaceuticals-philippines-inc-philippines-2018.pdf")
for ps in pdf.pages:        
print(ps.extract_text(y_tolerance=20))

However, the output I'm getting is as below. It is not picking up the exact text and returning symbols-

 (A WhLolUlyI POnwdMIenNapFerid ecnP nhSad Hun3ebc1Insi,aiLt d 2laAiI a0SnuPr1tdyda8P  itot eIafo mnLNrusde’pE  n2iRnSt0es H1, p o7oIl drNitn Cgs.,  B.V.) 
                    1135 Chino Roces Avenue    , Makati City, Philippines  None None None (SLATU AWPTIhENoMll yPE NOHwTISLn eIOdPF PS FuIINbNsEiAdSiNa, rCIyIN AoCfL L. PuOpiSnI HToIOldNings, B.V.) PDCTDIPRNACnrrreaueuSooeavfsferSnppudeTTehr Enae-enroofercryr tTdttnoeaotaaumaSymdntrllrb    i edrCNeAa tlneer asnuosoe tnxdrnsstslr ath e-eaee tcecAtsnequurssdtsu rre r rsiiAetptpeesyscamnts ertdseeit veiAtnesapstsbo–slesneitetssst–net No21129678t2201es                   PP                  434                 125863           21674007458  0,,,,,,,,,, 2367029366 1 26653629047825231406078 ,,,,,,,,,,, 43521581332  01450564793 M07073790873arc                  h            3                  1         P         P           323                 313732           51359221260  ,,,,,,,,,,1 1685100499 72767656537925485539399,,,,,,,,,,,446553288940724576365308101010455 SDSDTDLCENerIhqueeuoaeAaffuern dTeiNrrBc i-eroetetoiIcrot yt tneaLcau erdantsIlre    pdTrCtLltoaei aIuito atEnaxFretiblr Sthndle iiea L alnApnirbitt caaN piiiLlerbaiDatitlsaii yy elSbEiatstiabQylitlteeUimesIseTnYts  11211214                          PP            444(        134362    7907555  ,,,,,,, 146490647732044012981676,,,,,,,,968737381184639575050575)              PP      332     (    341039    5473266  ,,,,,,, 6692396 2801199  1417594  ,,,,,,, 2033441-32631380406151) 
1

There are 1 answers

0
Samkit Jain On

The exact text is not coming because of the high value of y_tolerance. When extracting text, pdfplumber adds newline characters where the difference between the doctop of one character and the doctop of the next is greater than y_tolerance. To resolve this, you may remove the param like in

for ps in pdf.pages:        
    print(ps.extract_text())

The weird symbols are coming because of the barcode at the bottom right and the bullet points in the PDF. If you try and copy the barcode on any page, you'll get weird emoji unicode characters. If you don't want them, you can simply replace them.