In training regression models with the text package in R, the model's size increases with the number of training datapoints, resulting in unnecessarily large model objects. The models are created using the parsnip package with the glmnet engine. R's memory handling system, which prevents data duplication, makes it difficult to distinguish what components/attributes of the model that take up space; for instance: object_size(model) shows 700 MB, but object_size(model$final_recipe) and object_size(model$final_model) are nearly the same at 698 MB respectively, and thus doesn’t show the actual size of the components.
- How can I efficiently identify and remove the memory-heavy components of the model to reduce its size, while maintaining its predictive ability?
Example:
object_size(model) # 700Mb
object_size(model$final_recipe) # 698Mb
object_size(model$final_model) # 698Mb
When removing the final_recipe attribute (just as an example, nothing I would do in practice):
model$final_recipe <- NULL
The size of the model is still 700Mb:
object_size(model) # 700Mb
If you only care about the predictions, you can extract the coefficients as a matrix and predict with a matrix multiplication. This example shows the result is the same as
predict(model, data).Created on 2024-01-19 with reprex v2.0.2