I am working with data.tables in R. The data has multiple records by id and I am trying to find the nth record for each individual using the .SD data.table option. If I specify N as an integer, the new data.table is created instantaneously. But if N is a variable (as it might be in a function), the code takes about 700 times longer. With large data sets, this is a problem. I was wondering if this is a known issue, and if there is any way to speed this up?
library(data.table)
library(microbenchmark)
set.seed(102938)
dd <- data.table(id = rep(1:10000, each = 10), seq = seq(1:10))
setkey(dd, id)
N <- 2
microbenchmark(dd[,.SD[2], keyby = id],
dd[,.SD[N], keyby = id],
times = 5)
#> Unit: microseconds
#> expr min lq mean median
#> dd[, .SD[2], keyby = id] 886.269 1584.513 2904.497 1851.356
#> dd[, .SD[N], keyby = id] 770822.875 810131.784 870418.622 903956.708
#> uq max neval
#> 1997.134 8203.214 5
#> 912223.026 954958.718 5
It may be better to do the subsetting with row index (
.I
) instead of.SD
-benchmarks
With
.I
, it got improved much better than.SD
, but there is still a performance hit and it would be the search time in the global env for finding the variable 'N'Internally, optimizations play a role in the timings. If we use, all optimizations FALSE by using the option
0
Now, the
.I
method is fasterChanging to 1
With 2 - gforce optimization - default method
Behind the hood optimizations can be checked with
verbose = TRUE