Scraping large and complex PDF tables

511 views Asked by At

I've been trying to scrape some data off of PDFs regarding 2020 election results in California for my own morbid curiosity.

I need to scrape many tables that appear across many pages. In some cases, the rows will continue onto the next page, and additional columns will appear on other pages as well. I've included a link to one example. I'm comfortable with R, but I can also use Python if that will be better for scraping. I haven't found many resources indicating how to deal with tables that carry onto additional pages for either language though. I need to get these tables into a CSV or XLSX format.

Thank you in advance!

In this example, Pages 15-28 should be one table. https://www.co.tehama.ca.us/images/images/Elections/StatementOfVotesCastNOV2020v2excel.pdf

1

There are 1 answers

4
G5W On

I was able to get the entire table using the following procedure.

  1. Open the pdf in MS Word - not Adobe Acrobat. Word will convert the document.
  2. After the conversion has completed, select all. (Both may take some time.)
  3. Paste into a blank Excel worksheet. Save and enjoy.