I'm trying to extract text from a pdf file using PDFplumber
import pdfplumber
pdf = pdfplumber.open(r"https://www.lupin.com/pdf/financials/subsidiaries/multicare-pharmaceuticals-philippines-inc-philippines-2018.pdf")
for ps in pdf.pages:
print(ps.extract_text(y_tolerance=20))
However, the output I'm getting is as below. It is not picking up the exact text and returning symbols-
(A WhLolUlyI POnwdMIenNapFerid ecnP nhSad Hun3ebc1Insi,aiLt d 2laAiI a0SnuPr1tdyda8P itot eIafo mnLNrusde’pE n2iRnSt0es H1, p o7oIl drNitn Cgs., B.V.)
1135 Chino Roces Avenue , Makati City, Philippines None None None (SLATU AWPTIhENoMll yPE NOHwTISLn eIOdPF PS FuIINbNsEiAdSiNa, rCIyIN AoCfL L. PuOpiSnI HToIOldNings, B.V.) PDCTDIPRNACnrrreaueuSooeavfsferSnppudeTTehr Enae-enroofercryr tTdttnoeaotaaumaSymdntrllrb i edrCNeAa tlneer asnuosoe tnxdrnsstslr ath e-eaee tcecAtsnequurssdtsu rre r rsiiAetptpeesyscamnts ertdseeit veiAtnesapstsbo–slesneitetssst–net No21129678t2201es PP 434 125863 21674007458 0,,,,,,,,,, 2367029366 1 26653629047825231406078 ,,,,,,,,,,, 43521581332 01450564793 M07073790873arc h 3 1 P P 323 313732 51359221260 ,,,,,,,,,,1 1685100499 72767656537925485539399,,,,,,,,,,,446553288940724576365308101010455 SDSDTDLCENerIhqueeuoaeAaffuern dTeiNrrBc i-eroetetoiIcrot yt tneaLcau erdantsIlre pdTrCtLltoaei aIuito atEnaxFretiblr Sthndle iiea L alnApnirbitt caaN piiiLlerbaiDatitlsaii yy elSbEiatstiabQylitlteeUimesIseTnYts 11211214 PP 444( 134362 7907555 ,,,,,,, 146490647732044012981676,,,,,,,,968737381184639575050575) PP 332 ( 341039 5473266 ,,,,,,, 6692396 2801199 1417594 ,,,,,,, 2033441-32631380406151)
The exact text is not coming because of the high value of
y_tolerance
. When extracting text,pdfplumber
adds newline characters where the difference between thedoctop
of one character and thedoctop
of the next is greater thany_tolerance
. To resolve this, you may remove the param like inThe weird symbols are coming because of the barcode at the bottom right and the bullet points in the PDF. If you try and copy the barcode on any page, you'll get weird emoji unicode characters. If you don't want them, you can simply replace them.