I am trying to scrape from a 276-page PDF available here: https://www.acf.hhs.gov/sites/default/files/documents/ocse/fy_2018_annual_report.pdf
Not only is the document very long but it also has tables in different formats. I tried using the extract_tables() function in the tabulizer library. This successfully scrapes the data tables beginning on page 143 of the document but does not work for the tables on pages 18-75. Are these pages unscrapable? If so why?
I get error messages that say "more columns than column names" and "duplicate 'row.names' are not allowed"
child_support_scrape <- extract_tables(
file = "C:/Users/Jenny/Downloads/OCSE_2018_annual_report.pdf",
method = "decide",
output = "data.frame")
As texts in pdf files are not stored in plain text format. It is generally hard to extract text from a pdf file. The following method provide an alternative method to extract the table from the pdf. It requires the
pdftools
andplyr
package.Additional formatting may be required to beautify the data frame.