How to group and aggregate a data.table based on a range of a variable in r

347 views Asked by At

I have this output from the pdftools pdf_data() for a page of the financial statements of a town. Unfortunately, in rare cases, the capture of a line y is slightly off, as shown below. I would like to be able to group on y including cases where y is +-1.

  library(data.table)
  data <- 
    read.csv(
      text =
     "x, y, text\n43, 391, Total\n66, 391, Expenditures\n260, 390, 6476803\n542, 390, 6773717"
     )
  data <- setDT(data)
  
  # View data
  print(data)
#>      x   y          text
#> 1:  43 391         Total
#> 2:  66 391  Expenditures
#> 3: 260 390       6476803
#> 4: 542 390       6773717
  
  # The problem
  data[, paste(text, collapse = ""), y]
#>      y                  V1
#> 1: 391  Total Expenditures
#> 2: 390     6476803 6773717

Desired output something like this if y <= y + 1 and y => y - 1:

#>       y                  V1
#> 1: c(391, 390)  Total Expenditures 6476803 6773717

Most attempts about grouping on a range suggest to create new columns for the hi and low, create a new variable cut() to group on, but I was unsure where to begin to implement this. I also have thousands of pages where the y's are constantly varying.

I generally use in data.table so a solution for that is much preferred.

Created on 2021-05-20 by the reprex package (v2.0.0)

0

There are 0 answers