Suppose we have a huge data.frame data
(60000x8000) which has been converted as data.table by setDT(data)
. This will create a reference of data
rather than make a copy of it, which is great. Then I want to obtain a subset, for example, the first 40000 rows, of data
.
id <- rep(FALSE, nrow(data))
id[1:40000] <- TRUE
data <- subset(data, id)
However, the code above doesn't perform well, since it will make a copy of data
, and I need to manually call gc()
to release the memory. In my example, additional 1.6 GB was consumed by subset
, which can be completely released by gc()
. I have read some documents about the use of data.table
and maybe I have missed something important. The examples I have found focus on extracting the subset of a data.table
and then assign it to a new variable, instead of to update the original one. For example:
new.data <- subset(data, id)
Thanks a lot.