Trouble with tabulizer library in r recognizing non-alphanumeric (symbol) characters on a table in a PDF

208 views Asked by At

I am using the tabulizer library in r to capture data from a table located inside a PDF on a public website (https://www.waterboards.ca.gov/sandiego/water_issues/programs/basin_plan/docs/update082812/Chpt_2_2012.pdf).

The example table that I am interested in is on page 23 of the PDF (p. 2-21, document has a couple of blankpages at beginning). The table has a non-standard format and also different symbols (non-alphanumeric characters in the cells). I want to extract most if not all tables from this document.

I want to end up with a table that has characters with codes (i.e., black circles with 999, white circles with 777, plus signs with -99, etc).

Tabulizer does a good job for the most part converting the dark circles into consistent alphanumeric codes, and keeping the plus signs, but runs into problems on the REC1 column with white circles, which is odd since it does seems to recognize exotic characters on other columns.

Could anyone please help fix this? I also tried selecting the table area but the output was worse. Below is the r code I am using.

I know I can complete this process by hand for all the tables in the document using PDF's built-in select and export tools but would like to automate the process.

library("tabulizer")
f2 <- "https://www.waterboards.ca.gov/sandiego/water_issues/programs/basin_plan/docs/update082812/Chpt_2_2012.pdf"
tab <- extract_tables(f2, pages = 23, method = 'lattice')
head(tab[[1]])
df <- as.data.frame(tab)
write.csv(df, file = "test.csv")
0

There are 0 answers