Consistently extracting tables using R's -pdftools- package : Line breaks issue

52 views Asked by At

I'm working on extracting text, font information, and tables from PDF files using R. Currently, I'm utilizing the pdftools package for this task. However, I find the process a bit tedious, especially when it comes to extracting tables.

As far as I know, pdftools provides two main functions for text extraction:

pdf_data(): This function extracts text at the word-level and provides detailed information such as font information, font_size, coordinates, etc.

pdf_text(): This function extracts text at the line-level, which can be useful for extracting tables.

However, due to line breaks I'm encountering challenges when it comes to accurately extracting tables using pdf_text(). To address this issue, I've developed a plan to improve the process:

  1. Utilize pdf_data() to extract font information.
  2. Use pdf_text() to extract line-level information.
  3. Merge the outputs of both functions by tokenizing the lines extracted by pdf_text().

Here's a simple example of a pdf screenshot I could be working on. And here's a simplified version of the code I'm using:

library("pdftools")
library("dplyr")

# Functions of -pdftools()-
pdf_text <- pdf_text("D:/VARNAME.pdf")
pdf_data <- pdf_data("D:/VARNAME.pdf", font_info = TRUE)

# Cleaning the outputs and merging them. 
pdf_text <- as.data.frame(str_split(pdf_text, "\n", simplify = TRUE))
pdf_text <- as.data.frame(t(pdf_text))
pdf_text <- pdf_text %>% 
  mutate(id = V1) %>%
  unnest_tokens(V2, V1, to_lower = FALSE, token = "regex", pattern = "\\s+") %>%
  mutate(obs_number = row_number())

pdf_data <- bind_rows(lapply(seq_along(pdf_data), function(i) mutate(pdf_data[[i]], page = i)))
pdf_data <- pdf_data[with(pdf_data, order(pdf_data$y, pdf_data$x)), ]
pdf_data <- pdf_data %>% 
  mutate(obs_number = row_number())

merged_df <- merge(pdf_text, pdf_data, by.x = c("V2", "obs_number"), by.y = c("text", "obs_number"), sort = FALSE, all = TRUE)

Which yields the following dataframe :

structure(list(V2 = c("VARNAME_1", "Current", "description", 
"of", "the", "varname.", "It", "has", "a", "specific", "combination", 
"of", "font_name", "and", "font_size", "so", "it", "is", "easy", 
"to", "extract", "it", "despite", "it", "having", "a", "line", 
"break.", "Input", "variable", "=", "XXX", "List", "of", "codes", 
"01", "How", "to", "consistently", "identify", "category", "labels", 
"and", "to", "match", "them", "with", "cat.", "numbers", "despite", 
"the", "line", "breaks?", "02", "Easy", "to", "get", "03", "Easy", 
"to", "get", "as", "well.", "Don’t", "know", "/", "Missing."
), obs_number = 1:67, id = c("VARNAME_1", "Current description of the varname. It has a specific combination of font_name and font_size so it", 
"Current description of the varname. It has a specific combination of font_name and font_size so it", 
"Current description of the varname. It has a specific combination of font_name and font_size so it", 
"Current description of the varname. It has a specific combination of font_name and font_size so it", 
"Current description of the varname. It has a specific combination of font_name and font_size so it", 
"Current description of the varname. It has a specific combination of font_name and font_size so it", 
"Current description of the varname. It has a specific combination of font_name and font_size so it", 
"Current description of the varname. It has a specific combination of font_name and font_size so it", 
"Current description of the varname. It has a specific combination of font_name and font_size so it", 
"Current description of the varname. It has a specific combination of font_name and font_size so it", 
"Current description of the varname. It has a specific combination of font_name and font_size so it", 
"Current description of the varname. It has a specific combination of font_name and font_size so it", 
"Current description of the varname. It has a specific combination of font_name and font_size so it", 
"Current description of the varname. It has a specific combination of font_name and font_size so it", 
"Current description of the varname. It has a specific combination of font_name and font_size so it", 
"Current description of the varname. It has a specific combination of font_name and font_size so it", 
"is easy to extract it despite it having a line break.", "is easy to extract it despite it having a line break.", 
"is easy to extract it despite it having a line break.", "is easy to extract it despite it having a line break.", 
"is easy to extract it despite it having a line break.", "is easy to extract it despite it having a line break.", 
"is easy to extract it despite it having a line break.", "is easy to extract it despite it having a line break.", 
"is easy to extract it despite it having a line break.", "is easy to extract it despite it having a line break.", 
"is easy to extract it despite it having a line break.", "Input variable = XXX", 
"Input variable = XXX", "Input variable = XXX", "Input variable = XXX", 
"List of codes", "List of codes", "List of codes", " 01                                               How to consistently identify category labels and", 
" 01                                               How to consistently identify category labels and", 
" 01                                               How to consistently identify category labels and", 
" 01                                               How to consistently identify category labels and", 
" 01                                               How to consistently identify category labels and", 
" 01                                               How to consistently identify category labels and", 
" 01                                               How to consistently identify category labels and", 
" 01                                               How to consistently identify category labels and", 
"                                                  to match them with cat. numbers despite the", 
"                                                  to match them with cat. numbers despite the", 
"                                                  to match them with cat. numbers despite the", 
"                                                  to match them with cat. numbers despite the", 
"                                                  to match them with cat. numbers despite the", 
"                                                  to match them with cat. numbers despite the", 
"                                                  to match them with cat. numbers despite the", 
"                                                  to match them with cat. numbers despite the", 
"                                                  line breaks?", 
"                                                  line breaks?", 
" 02                                               Easy to get", 
" 02                                               Easy to get", 
" 02                                               Easy to get", 
" 02                                               Easy to get", 
" 03                                               Easy to get as well.", 
" 03                                               Easy to get as well.", 
" 03                                               Easy to get as well.", 
" 03                                               Easy to get as well.", 
" 03                                               Easy to get as well.", 
" 03                                               Easy to get as well.", 
"                                                  Don’t know / Missing.", 
"                                                  Don’t know / Missing.", 
"                                                  Don’t know / Missing.", 
"                                                  Don’t know / Missing."
), width = c(64L, 34L, 51L, 9L, 15L, 43L, 6L, 15L, 5L, 33L, 57L, 
9L, 50L, 17L, 41L, 10L, 6L, 7L, 20L, 9L, 32L, 6L, 33L, 6L, 30L, 
5L, 16L, 29L, 23L, 35L, 5L, 17L, 15L, 9L, 26L, 11L, 20L, 9L, 
53L, 34L, 38L, 25L, 16L, 9L, 28L, 23L, 19L, 16L, 39L, 32L, 14L, 
16L, 34L, 11L, 19L, 9L, 14L, 11L, 20L, 9L, 14L, 9L, 21L, 24L, 
24L, 4L, 36L), height = c(12L, 11L, 11L, 11L, 11L, 11L, 11L, 
11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 
11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 
11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 
11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 
11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L), x = c(70L, 70L, 108L, 
161L, 173L, 191L, 237L, 246L, 264L, 272L, 309L, 369L, 381L, 434L, 
453L, 497L, 510L, 70L, 80L, 103L, 115L, 150L, 159L, 195L, 204L, 
237L, 245L, 264L, 70L, 97L, 135L, 143L, 70L, 88L, 100L, 76L, 
303L, 326L, 338L, 394L, 430L, 472L, 500L, 303L, 315L, 345L, 371L, 
394L, 413L, 455L, 490L, 303L, 321L, 76L, 303L, 325L, 337L, 76L, 
303L, 325L, 337L, 354L, 366L, 303L, 330L, 357L, 363L), y = c(73L, 
96L, 96L, 96L, 96L, 96L, 96L, 96L, 96L, 96L, 96L, 96L, 96L, 96L, 
96L, 96L, 96L, 111L, 111L, 111L, 111L, 111L, 111L, 111L, 111L, 
111L, 111L, 111L, 133L, 133L, 133L, 133L, 156L, 156L, 156L, 179L, 
179L, 179L, 179L, 179L, 179L, 179L, 179L, 192L, 192L, 192L, 192L, 
192L, 192L, 192L, 192L, 206L, 206L, 220L, 220L, 220L, 220L, 234L, 
234L, 234L, 234L, 234L, 234L, 247L, 247L, 247L, 247L), space = c(FALSE, 
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, 
TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, 
TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, 
TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, 
FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, 
FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, 
FALSE, TRUE, TRUE, TRUE, FALSE), font_name = c("BCDEEE+Calibri-Bold", 
"BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", 
"BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", 
"BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", 
"BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", 
"BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", 
"BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", 
"BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", 
"BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", 
"BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", 
"BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", 
"BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", 
"BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", 
"BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", 
"BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", 
"BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", 
"BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", 
"BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", 
"BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", 
"BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri"
), font_size = c(12, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 
11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 
11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 
11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 
11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 
11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 
11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 
11.04, 11.04, 11.04, 11.04, 11.04, 11.04), page = c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
)), class = "data.frame", row.names = c(NA, -67L))

For the variable description, I can just concatenate id values together if, for one specific combination of font_name and font_size, there are more than one id value. But I don't know how to deal with line breaks within a cell, like in cell A2 of the picture.

My goal is to manage to get this output:

# A tibble: 4 × 2
  cat_number cat_label                                                                                                
  <chr>      <chr>                                                                                                    
1 01         How to consistently identify category labels and to match them with cat. numbers despite the line breaks?
2 02         Easy to get                                                                                              
3 03         Easy to get as well.                                                                                     
4 NA         Don't know / Missing.     

Btw, I must use exclusively pdftools because I need the font information for later purposes.

0

There are 0 answers