I have this output from the pdftools pdf_data() for a page of the financial statements of a town. Unfortunately, in rare cases, the capture of a line y is slightly off, as shown below. I would like to be able to group on y including cases where y is +-1.
library(data.table)
data <-
read.csv(
text =
"x, y, text\n43, 391, Total\n66, 391, Expenditures\n260, 390, 6476803\n542, 390, 6773717"
)
data <- setDT(data)
# View data
print(data)
#> x y text
#> 1: 43 391 Total
#> 2: 66 391 Expenditures
#> 3: 260 390 6476803
#> 4: 542 390 6773717
# The problem
data[, paste(text, collapse = ""), y]
#> y V1
#> 1: 391 Total Expenditures
#> 2: 390 6476803 6773717
Desired output something like this if y <= y + 1 and y => y - 1:
#> y V1
#> 1: c(391, 390) Total Expenditures 6476803 6773717
Most attempts about grouping on a range suggest to create new columns for the hi and low, create a new variable cut() to group on, but I was unsure where to begin to implement this. I also have thousands of pages where the y's are constantly varying.
I generally use in data.table so a solution for that is much preferred.
Created on 2021-05-20 by the reprex package (v2.0.0)