trying to scrape from long PDF with different table formats

Question

trying to scrape from long PDF with different table formats

1k views Asked by Jennifer B. At 29 April 2021 at 19:03

I am trying to scrape from a 276-page PDF available here: https://www.acf.hhs.gov/sites/default/files/documents/ocse/fy_2018_annual_report.pdf

Not only is the document very long but it also has tables in different formats. I tried using the extract_tables() function in the tabulizer library. This successfully scrapes the data tables beginning on page 143 of the document but does not work for the tables on pages 18-75. Are these pages unscrapable? If so why?

I get error messages that say "more columns than column names" and "duplicate 'row.names' are not allowed"

child_support_scrape <- extract_tables(
  file   = "C:/Users/Jenny/Downloads/OCSE_2018_annual_report.pdf", 
  method = "decide", 
  output = "data.frame")

Original Q&A

There are 1 answers

**s20012303** · Accepted Answer · 2021-04-29T19:46:26+00:00

As texts in pdf files are not stored in plain text format. It is generally hard to extract text from a pdf file. The following method provide an alternative method to extract the table from the pdf. It requires the pdftools and plyr package.

# Download the pdf file as a variable in R
pdf_text <- pdftools::pdf_text("https://www.acf.hhs.gov/sites/default/files/documents/ocse/fy_2018_annual_report.pdf")

# Focus on the table in page 22
pdf_text22 <- strsplit(pdf_text[[22]], "\n")[[1]]

# Reformat the table using "regular expression"
pdf_text22 <- strsplit(pdf_text22, " {2,100}")

# Convert the table in a data frame
pdf_text22 <- plyr::rbind.fill(lapply(pdf_text22, function(x) as.data.frame(t(matrix(x)))))

Additional formatting may be required to beautify the data frame.

TechQA.

trying to scrape from long PDF with different table formats

There are 1 answers

Related Questions in R

Related Questions in PDF

Related Questions in DATA-EXTRACTION

Related Questions in PDF-SCRAPING

Related Questions in TABULIZER

Popular Questions

Popular Tags

Trending Questions