I'm working on extracting text, font information, and tables from PDF files using R. Currently, I'm utilizing the pdftools package for this task. However, I find the process a bit tedious, especially when it comes to extracting tables.
As far as I know, pdftools provides two main functions for text extraction:
pdf_data(): This function extracts text at the word-level and provides detailed information such as font information, font_size, coordinates, etc.
pdf_text(): This function extracts text at the line-level, which can be useful for extracting tables.
However, due to line breaks I'm encountering challenges when it comes to accurately extracting tables using pdf_text(). To address this issue, I've developed a plan to improve the process:
- Utilize
pdf_data()to extract font information. - Use
pdf_text()to extract line-level information. - Merge the outputs of both functions by tokenizing the lines extracted by
pdf_text().
Here's a simple example of a pdf screenshot I could be working on. And here's a simplified version of the code I'm using:
library("pdftools")
library("dplyr")
# Functions of -pdftools()-
pdf_text <- pdf_text("D:/VARNAME.pdf")
pdf_data <- pdf_data("D:/VARNAME.pdf", font_info = TRUE)
# Cleaning the outputs and merging them.
pdf_text <- as.data.frame(str_split(pdf_text, "\n", simplify = TRUE))
pdf_text <- as.data.frame(t(pdf_text))
pdf_text <- pdf_text %>%
mutate(id = V1) %>%
unnest_tokens(V2, V1, to_lower = FALSE, token = "regex", pattern = "\\s+") %>%
mutate(obs_number = row_number())
pdf_data <- bind_rows(lapply(seq_along(pdf_data), function(i) mutate(pdf_data[[i]], page = i)))
pdf_data <- pdf_data[with(pdf_data, order(pdf_data$y, pdf_data$x)), ]
pdf_data <- pdf_data %>%
mutate(obs_number = row_number())
merged_df <- merge(pdf_text, pdf_data, by.x = c("V2", "obs_number"), by.y = c("text", "obs_number"), sort = FALSE, all = TRUE)
Which yields the following dataframe :
structure(list(V2 = c("VARNAME_1", "Current", "description",
"of", "the", "varname.", "It", "has", "a", "specific", "combination",
"of", "font_name", "and", "font_size", "so", "it", "is", "easy",
"to", "extract", "it", "despite", "it", "having", "a", "line",
"break.", "Input", "variable", "=", "XXX", "List", "of", "codes",
"01", "How", "to", "consistently", "identify", "category", "labels",
"and", "to", "match", "them", "with", "cat.", "numbers", "despite",
"the", "line", "breaks?", "02", "Easy", "to", "get", "03", "Easy",
"to", "get", "as", "well.", "Don’t", "know", "/", "Missing."
), obs_number = 1:67, id = c("VARNAME_1", "Current description of the varname. It has a specific combination of font_name and font_size so it",
"Current description of the varname. It has a specific combination of font_name and font_size so it",
"Current description of the varname. It has a specific combination of font_name and font_size so it",
"Current description of the varname. It has a specific combination of font_name and font_size so it",
"Current description of the varname. It has a specific combination of font_name and font_size so it",
"Current description of the varname. It has a specific combination of font_name and font_size so it",
"Current description of the varname. It has a specific combination of font_name and font_size so it",
"Current description of the varname. It has a specific combination of font_name and font_size so it",
"Current description of the varname. It has a specific combination of font_name and font_size so it",
"Current description of the varname. It has a specific combination of font_name and font_size so it",
"Current description of the varname. It has a specific combination of font_name and font_size so it",
"Current description of the varname. It has a specific combination of font_name and font_size so it",
"Current description of the varname. It has a specific combination of font_name and font_size so it",
"Current description of the varname. It has a specific combination of font_name and font_size so it",
"Current description of the varname. It has a specific combination of font_name and font_size so it",
"Current description of the varname. It has a specific combination of font_name and font_size so it",
"is easy to extract it despite it having a line break.", "is easy to extract it despite it having a line break.",
"is easy to extract it despite it having a line break.", "is easy to extract it despite it having a line break.",
"is easy to extract it despite it having a line break.", "is easy to extract it despite it having a line break.",
"is easy to extract it despite it having a line break.", "is easy to extract it despite it having a line break.",
"is easy to extract it despite it having a line break.", "is easy to extract it despite it having a line break.",
"is easy to extract it despite it having a line break.", "Input variable = XXX",
"Input variable = XXX", "Input variable = XXX", "Input variable = XXX",
"List of codes", "List of codes", "List of codes", " 01 How to consistently identify category labels and",
" 01 How to consistently identify category labels and",
" 01 How to consistently identify category labels and",
" 01 How to consistently identify category labels and",
" 01 How to consistently identify category labels and",
" 01 How to consistently identify category labels and",
" 01 How to consistently identify category labels and",
" 01 How to consistently identify category labels and",
" to match them with cat. numbers despite the",
" to match them with cat. numbers despite the",
" to match them with cat. numbers despite the",
" to match them with cat. numbers despite the",
" to match them with cat. numbers despite the",
" to match them with cat. numbers despite the",
" to match them with cat. numbers despite the",
" to match them with cat. numbers despite the",
" line breaks?",
" line breaks?",
" 02 Easy to get",
" 02 Easy to get",
" 02 Easy to get",
" 02 Easy to get",
" 03 Easy to get as well.",
" 03 Easy to get as well.",
" 03 Easy to get as well.",
" 03 Easy to get as well.",
" 03 Easy to get as well.",
" 03 Easy to get as well.",
" Don’t know / Missing.",
" Don’t know / Missing.",
" Don’t know / Missing.",
" Don’t know / Missing."
), width = c(64L, 34L, 51L, 9L, 15L, 43L, 6L, 15L, 5L, 33L, 57L,
9L, 50L, 17L, 41L, 10L, 6L, 7L, 20L, 9L, 32L, 6L, 33L, 6L, 30L,
5L, 16L, 29L, 23L, 35L, 5L, 17L, 15L, 9L, 26L, 11L, 20L, 9L,
53L, 34L, 38L, 25L, 16L, 9L, 28L, 23L, 19L, 16L, 39L, 32L, 14L,
16L, 34L, 11L, 19L, 9L, 14L, 11L, 20L, 9L, 14L, 9L, 21L, 24L,
24L, 4L, 36L), height = c(12L, 11L, 11L, 11L, 11L, 11L, 11L,
11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L,
11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L,
11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L,
11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L,
11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L), x = c(70L, 70L, 108L,
161L, 173L, 191L, 237L, 246L, 264L, 272L, 309L, 369L, 381L, 434L,
453L, 497L, 510L, 70L, 80L, 103L, 115L, 150L, 159L, 195L, 204L,
237L, 245L, 264L, 70L, 97L, 135L, 143L, 70L, 88L, 100L, 76L,
303L, 326L, 338L, 394L, 430L, 472L, 500L, 303L, 315L, 345L, 371L,
394L, 413L, 455L, 490L, 303L, 321L, 76L, 303L, 325L, 337L, 76L,
303L, 325L, 337L, 354L, 366L, 303L, 330L, 357L, 363L), y = c(73L,
96L, 96L, 96L, 96L, 96L, 96L, 96L, 96L, 96L, 96L, 96L, 96L, 96L,
96L, 96L, 96L, 111L, 111L, 111L, 111L, 111L, 111L, 111L, 111L,
111L, 111L, 111L, 133L, 133L, 133L, 133L, 156L, 156L, 156L, 179L,
179L, 179L, 179L, 179L, 179L, 179L, 179L, 192L, 192L, 192L, 192L,
192L, 192L, 192L, 192L, 206L, 206L, 220L, 220L, 220L, 220L, 234L,
234L, 234L, 234L, 234L, 234L, 247L, 247L, 247L, 247L), space = c(FALSE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE,
TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE,
FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE,
FALSE, TRUE, TRUE, TRUE, FALSE), font_name = c("BCDEEE+Calibri-Bold",
"BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold",
"BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold",
"BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold",
"BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold",
"BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold",
"BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold",
"BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold",
"BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold",
"BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold",
"BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri",
"BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold", "BCDEEE+Calibri-Bold",
"BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri",
"BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri",
"BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri",
"BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri",
"BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri",
"BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri",
"BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri",
"BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri", "BCDFEE+Calibri"
), font_size = c(12, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04,
11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04,
11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04,
11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04,
11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04,
11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04,
11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04, 11.04,
11.04, 11.04, 11.04, 11.04, 11.04, 11.04), page = c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
)), class = "data.frame", row.names = c(NA, -67L))
For the variable description, I can just concatenate id values together if, for one specific combination of font_name and font_size, there are more than one id value. But I don't know how to deal with line breaks within a cell, like in cell A2 of the picture.
My goal is to manage to get this output:
# A tibble: 4 × 2
cat_number cat_label
<chr> <chr>
1 01 How to consistently identify category labels and to match them with cat. numbers despite the line breaks?
2 02 Easy to get
3 03 Easy to get as well.
4 NA Don't know / Missing.
Btw, I must use exclusively pdftools because I need the font information for later purposes.