Saving Random Forest Classifiers (sklearn) with picke/joblib creates huge files

1.2k views Asked by At

I am trying to save a bunch of trained random forest classifiers in order to reuse them later. For this, I am trying to use pickle or joblib. The problem I encounter is, that the saved files get huge. This seems to be correlated to the amount of data that I use for training (which is several 10-millions of samples per forest, leading to dumped files in the order of up to 20GB!).

Is the RF classifier itself saving the training data in its structure? If so, how could I take the structure apart and only save the necessary parameters for later predictions? Sadly, I could not find anything on the subject of size yet.

Thanks for your help! Baradrist

Here's what I did in a nutshell:

I trained the (fairly standard) RF on a large dataset and saved the trained forest afterwards, trying both pickle and joblib (also with the compress-option set to 3).

X_train, y_train = ... some data

classifier = RandomForestClassifier(n_estimators=24, max_depth=10)
classifier.fit(X_train, y_train)

pickle.dump(classifier, open(path+'classifier.pickle', 'wb'))

or

joblib.dump(classifier, path+'classifier.joblib', compress=True)

Since the saved files got quite big (5GB to nearly 20GB, compressed aprox. 1/3 of this - and I will need >50 such forests!) and the training takes a while, I experimented with different subsets of the training data. Depending on the size of the train set, I found different sizes for the saved classifier, making me believe that information about the training is pickled/joblibed as well. This seems unintuitive to me, as for predictions, I only need the information of all the trained weak predictors (decision trees) which should be steady and since the number of trees and the max depth is not too high, they should also not take up that much space. And certainly not more due to a larger training set.

All in all, I suspect that the structure is containing more than I need. Yet, I couldn't find a good answer on how to exclude these parts from it and save only the necessary information for my future predictions.

1

There are 1 answers

2
pietroppeter On

I ran into a similar issue and I also thought in the beginning that the model was saving unnecessary information or that the serialization was introducing some redundancy. It turns out in fact that decision trees are indeed memory hungry structures that consists of multiple arrays of length given by the total number of nodes. Nodes in general grow with the size of data (and parameters like max_depth cannot effectively used to limit growth since the reasonable values still have room to generate huge number of nodes). See details in this answer but the gist is:

  • a single decision tree can easy grow to a few MBs (example above has a 5MB decision tree for 100K data and a 50MB decision tree for 1M data)
  • a random forest commonly contains at least 100 such decision tree and for the example above you would have models in the range of 0.5/5GB
  • compression is usually not enough to reduce to reasonable sizes (1/2, 1/3 are usual ranges)

Other notes:

  • using a different algorithm models might remain of a more manageable size (e.g. with xgboost I saw much smaller serialized models)
  • it is probably possible to "prune" some of the data used by decision trees if you only plan it to reuse it for prediction. In particular, I imagine the array of impurity and possibly those on n_samples might not be needed but I have not checked.
  • with respect to your hypothesis that the random forest is saving the data on which it is trained: note it is not and the data itself would likely be one or more order of magnitude smaller than the final model
  • another strategy, if you have a reproducible training pipeline, could be to save the data instead of the model and retrain on purpose, but this is only possible if you can spare the time to retrain (for example if in a use case where you have a long running service which has the model in memory and you serialize the model in order to have a backup for when the model goes down)

there are probably also other options to limit growth of random forest, the best one I have found until now is in this answer, where the suggestion is to work with min_samples_leaf to set it as a percentage of data