Consistently extracting tables using R's -pdftools- package : Line breaks issue

Question

Consistently extracting tables using R's -pdftools- package : Line breaks issue

52 views Asked by Val At 19 February 2024 at 14:38

I'm working on extracting text, font information, and tables from PDF files using R. Currently, I'm utilizing the pdftools package for this task. However, I find the process a bit tedious, especially when it comes to extracting tables.

As far as I know, pdftools provides two main functions for text extraction:

pdf_data(): This function extracts text at the word-level and provides detailed information such as font information, font_size, coordinates, etc.

pdf_text(): This function extracts text at the line-level, which can be useful for extracting tables.

However, due to line breaks I'm encountering challenges when it comes to accurately extracting tables using pdf_text(). To address this issue, I've developed a plan to improve the process:

Utilize pdf_data() to extract font information.
Use pdf_text() to extract line-level information.
Merge the outputs of both functions by tokenizing the lines extracted by pdf_text().

Here's a simple example of a pdf screenshot I could be working on. And here's a simplified version of the code I'm using:

library("pdftools")
library("dplyr")

# Functions of -pdftools()-
pdf_text <- pdf_text("D:/VARNAME.pdf")
pdf_data <- pdf_data("D:/VARNAME.pdf", font_info = TRUE)

# Cleaning the outputs and merging them. 
pdf_text <- as.data.frame(str_split(pdf_text, "\n", simplify = TRUE))
pdf_text <- as.data.frame(t(pdf_text))
pdf_text <- pdf_text %>% 
  mutate(id = V1) %>%
  unnest_tokens(V2, V1, to_lower = FALSE, token = "regex", pattern = "\\s+") %>%
  mutate(obs_number = row_number())

pdf_data <- bind_rows(lapply(seq_along(pdf_data), function(i) mutate(pdf_data[[i]], page = i)))
pdf_data <- pdf_data[with(pdf_data, order(pdf_data$y, pdf_data$x)), ]
pdf_data <- pdf_data %>% 
  mutate(obs_number = row_number())

merged_df <- merge(pdf_text, pdf_data, by.x = c("V2", "obs_number"), by.y = c("text", "obs_number"), sort = FALSE, all = TRUE)

Which yields the following dataframe :

structure(list(V2 = c("VARNAME_1", "Current", "description", 
"of", "the", "varname.", "It", "has", "a", "specific", "combination", 
"of", "font_name", "and", "font_size", "so", "it", "is", "easy", 
"to", "extract", "it", "despite", "it", "having", "a", "line", 
"break.", "Input", "variable", "=", "XXX", "List", "of", "codes", 
"01", "How", "to", "consistently", "identify", "category", "labels", 
"and", "to", "match", "them", "with", "cat.", "numbers", "despite", 
"the", "line", "breaks?", "02", "Easy", "to", "get", "03", "Easy", 
"to", "get", "as", "well.", "Don’t", "know", "/", "Missing."
), obs_number = 1:67, id = c("VARNAME_1", "Current description of the varname. It has a specific combination of font_name and font_size so it", 
"Current description of the varname. It has a specific combination of font_name and font_size so it", 
"Current description of the varname. It has a specific combination of font_name and font_size so it", 
"Current description of the varname. It has a specific combination of font_name and font_size so it", 
"Current description of the varname. It has a specific combination of font_name and font_size so it", 
"Current description of the varname. It has a specific combination of font_name and font_size so it", 
"Current description of the varname. It has a specific combination of font_name and font_size so it", 
"Current description of the varname. It has a specific combination of font_name and font_size so it", 
"Current description of the varname. It has a specific combination of font_name and font_size so it", 
"Current description of the varname. It has a specific combination of font_name and font_size so it", 
"Current description of the varname. It has a specific combination of font_name and font_size so it", 
"Current description of the varname. It has a specific combination of font_name and font_size so it", 
"Current description of the varname. It has a specific combination of font_name and font_size so it", 
"Current description of the varname. It has a specific combination of font_name and font_size so it", 
"Current description of the varname. It has a specific combination of font_name and font_size so it", 
"Current description of the varname. It has a specific combination of font_name and font_size so it", 
"is easy to extract it despite it having a line break.", "is easy to extract it despite it having a line break.", 
"is easy to extract it despite it having a line break.", "is easy to extract it despite it having a line break.", 
"is easy to extract it despite it having a line break.", "is easy to extract it despite it having a line break.", 
"is easy to extract it despite it having a line break.", "is easy to extract it despite it having a line break.", 
"is easy to extract it despite it having a line break.", "is easy to extract it despite it having a line break.", 
"is easy to extract it despite it having a line break.", "Input variable = XXX", 
"Input variable = XXX", "Input variable = XXX", "Input variable = XXX", 
"List of codes", "List of codes", "List of codes", " 01                                               How to consistently identify category labels and", 
" 01                                               How to consistently identify category labels and", 
" 01                                               How to consistently identify category labels and", 
" 01                                               How to consistently identify category labels and", 
" 01                                               How to consistently identify category labels and", 
" 01                                               How to consistently identify category labels and", 
" 01                                               How to consistently identify category labels and", 
" 01                                               How to consistently identify category labels and", 
"                                                  to match them with cat. numbers despite the", 
"                                                  to match them with cat. numbers despite the", 
"                                                  to match them with cat. numbers despite the", 
"                                                  to match them with cat. numbers despite the", 
"                                                  to match them with cat. numbers despite the", 
"                                                  to match them with cat. numbers despite the", 
"                                                  to match them with cat. numbers despite the", 
"                                                  to match them with cat. numbers despite the", 
"                                                  line breaks?", 
"                                                  line breaks?", 
" 02                                               Easy to get", 
" 02                                               Easy to get", 
" 02                                               Easy to get", 
" 02                                               Easy to get", 
" 03                                               Easy to get as well.", 
" 03                                               Easy to get as well.", 
" 03                                               Easy to get as well.", 
" 03                                               Easy to get as well.", 
" 03                                               Easy to get as well.", 
" 03                                               Easy to get as well.", 
"                                                  Don’t know / Missing.", 
"                                                  Don’t know / Missing.", 
"                                                  Don’t know / Missing.", 
"                                                  Don’t know / Missing."
), width = c(64L, 34L, 51L, 9L, 15L, 43L, 6L, 15L, 5L, 33L, 57L, 
9L, 50L, 17L, 41L, 10L, 6L, 7L, 20L, 9L, 32L, 6L, 33L, 6L, 30L, 
5L, 16L, 29L, 23L, 35L, 5L, 17L, 15L, 9L, 26L, 11L, 20L, 9L, 
53L, 34L, 38L, 25L, 16L, 9L, 28L, 23L, 19L, 16L, 39L, 32L, 14L, 
16L, 34L, 11L, 19L, 9L, 14L, 11L, 20L, 9L, 14L, 9L, 21L, 24L, 
24L, 4L, 36L), height = c(12L, 11L, 11L, 11L, 11L, 11L, 11L, 
11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 
11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 
11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 
11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 
11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L), x = c(70L, 70L, 108L, 
161L, 173L, 191L, 237L, 246L, 264L, 272L, 309L, 369L, 381L, 434L, 
453L, 497L, 510L, 70L, 80L, 103L, 115L, 150L, 159L, 195L, 204L, 
237L, 245L, 264L, 70L, 97L, 135L, 143L, 70L, 88L, 100L, 76L, 
303L, 326L, 338L, 394L, 430L, 472L, 500L, 303L, 315L, 345L, 371L, 
394L, 413L, 455L, 490L, 303L, 321L, 76L, 303L, 325L, 337L, 76L, 
303L, 325L, 337L, 354L, 366L, 303L, 330L, 357L, 363L), y = c(73L, 
96L, 96L, 96L, 96L, 96L, 96L, 96L, 96L, 96L, 96L, 96L, 96L, 96L, 
96L, 96L, 96L, 111L, 111L, 111L, 111L, 111L, 111L, 111L, 111L, 
111L, 111L, 111L, 133L, 133L, 133L, 133L, 156L, 156L, 156L, 179L, 
179L, 179L, 179L, 179L, 179L, 179L, 179L, 192L, 192L, 192L, 192L, 
192L, 192L, 192L, 192L, 206L, 206L, 220L, 220L, 220L, 220L, 234L, 
234L, 234L, 234L, 234L, 234L, 247L, 247L, 247L, 247L), space = c(FALSE, 
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, 
TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, 
TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, 
TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, 
FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, 
FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, 
FALSE, TRUE, TRUE, TRUE, FALSE), font_name = c("BCDEEE+Calibri-Bold", 
"BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", 
"BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", 
"BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", 
"BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", 
"BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", 
"BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", 
"BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", 
"BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", 
"BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", 
"BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", 
"BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", 
"BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", 
"BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", 
"BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", 
"BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", 
"BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", 
"BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", 
"BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", 
"BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri"
), font_size = c(12, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 
11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 
11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 
11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 
11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 
11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 
11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 
11.04, 11.04, 11.04, 11.04, 11.04, 11.04), page = c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
)), class = "data.frame", row.names = c(NA, -67L))

For the variable description, I can just concatenate id values together if, for one specific combination of font_name and font_size, there are more than one id value. But I don't know how to deal with line breaks within a cell, like in cell A2 of the picture.

My goal is to manage to get this output:

# A tibble: 4 × 2
  cat_number cat_label                                                                                                
  <chr>      <chr>                                                                                                    
1 01         How to consistently identify category labels and to match them with cat. numbers despite the line breaks?
2 02         Easy to get                                                                                              
3 03         Easy to get as well.                                                                                     
4 NA         Don't know / Missing.

Btw, I must use exclusively pdftools because I need the font information for later purposes.

Original Q&A

TechQA.

Consistently extracting tables using R's -pdftools- package : Line breaks issue

There are 0 answers

Related Questions in R

Related Questions in PDF

Related Questions in PDFTOOLS

Popular Questions

Trending Questions