As you know, Isolation forest model in scikit-learn has a parameter, bootstrap. The description is like below.
If True, individual trees are fit on random subsets of the training data sampled with replacement. If False, sampling without replacement is performed.
I made a simple data and trained a isolation forest model. But the evaluation results were quite different whether bootstrap = True or False. Please refer to below codes.
import numpy as np
from sklearn.ensemble import IsolationForest
np.random.seed(0)
# making train and test data
size = 10
train_x = np.concatenate( (np.random.uniform(0,1,size=(size,1)), np.array([[100]]) ), axis=0, )
train_y = [1]*size + [-1]
test_x = np.concatenate((np.random.uniform(0,1,size = (size,1)), np.array([[102]])), axis=0)
test_y = train_y.copy()
# defining accuracy
def accuracy(y_true, y_pred):
return sum(1 for i in range(len(y_true)) if y_true[i] == y_pred[i] ) / len(y_true)
# when bootstrap = True
iso = IsolationForest(n_estimators = 100, max_samples= 4, max_features = 1.0, bootstrap = True, random_state= 0)
iso.fit(train_x)
predicted_y = iso.predict(test_x)
print(accuracy(test_y, predicted_y)) # 0.8182
# when bootstrap = False
iso = IsolationForest(n_estimators = 100, max_samples= 4, max_features = 1.0, bootstrap = False, random_state= 0)
iso.fit(train_x)
predicted_y = iso.predict(test_x)
print(accuracy(test_y, predicted_y)) # 1.0
My question is,
- What is the role of bootstrap parameter in isolation forest?
- By what criteria should bootstrap parameter be selected in isolation forest?
Please let me know when to select True and when to select False.
If
bootstrapis set asFalsethen you essentially create a number of identical decision trees containing the entire training dataset.The entire premise of the Random Forest style models is that a bootstrap sample (i.e. with replacement) is taken from the dataset for each of the trees and this allows the model to generalise much better than a decision tree can.
Long story short, if you want a Forest to be a proper Forest, bootstrap should always be set to
True.