Trouble with tabulizer library in r recognizing non-alphanumeric (symbol) characters on a table in a PDF

206 views Asked by user11036517 At 10 December 2019 at 01:38

I am using the tabulizer library in r to capture data from a table located inside a PDF on a public website (https://www.waterboards.ca.gov/sandiego/water_issues/programs/basin_plan/docs/update082812/Chpt_2_2012.pdf).

The example table that I am interested in is on page 23 of the PDF (p. 2-21, document has a couple of blankpages at beginning). The table has a non-standard format and also different symbols (non-alphanumeric characters in the cells). I want to extract most if not all tables from this document.

I want to end up with a table that has characters with codes (i.e., black circles with 999, white circles with 777, plus signs with -99, etc).

Tabulizer does a good job for the most part converting the dark circles into consistent alphanumeric codes, and keeping the plus signs, but runs into problems on the REC1 column with white circles, which is odd since it does seems to recognize exotic characters on other columns.

Could anyone please help fix this? I also tried selecting the table area but the output was worse. Below is the r code I am using.

I know I can complete this process by hand for all the tables in the document using PDF's built-in select and export tools but would like to automate the process.

library("tabulizer")
f2 <- "https://www.waterboards.ca.gov/sandiego/water_issues/programs/basin_plan/docs/update082812/Chpt_2_2012.pdf"
tab <- extract_tables(f2, pages = 23, method = 'lattice')
head(tab[[1]])
df <- as.data.frame(tab)
write.csv(df, file = "test.csv")

Original Q&A

TechQA.

Trouble with tabulizer library in r recognizing non-alphanumeric (symbol) characters on a table in a PDF

There are 0 answers

Related Questions in PDF

Related Questions in SYMBOLS

Related Questions in PDF-SCRAPING

Related Questions in NON-ALPHANUMERIC

Related Questions in PDFTABLES

Popular Questions

Popular Tags

Trending Questions