Extract text well from a PDF with two columns in R

442 views Asked by At

I am trying to extract the texts of the annual reports of the companies. Its design is in the majority of two columns. So I don't know how to extract it correctly, since in R I with the pdftools package, I extract the first line of the first column next to the first line of the second column, instead of the second line of the first column.

This is my code:

library(pdftools)
readpdf<- pdf_text("https://www.telefonica.com/documents/153952/13347920/2019-Telefonica-Consolidated-Management-Report.pdf/0a9c8382-c9ff-ba52-1d5b-e431a7efab3f")

How can I do this correctly?

1

There are 1 answers

2
Max Volpi On

My answer would be using something like ABBY Fine reader or equivalent OCR software. I have tried on the same sort of data to use the open source software available in R, but it did not work well enough for my purposes