I am trying to build an outlier detector to find outliers in test data. That data varies a bit (more test channels, longer testing).
First im applying the train test split because i wanted to use grid search with train data to get the best results. This is timeseries data from multiple sensors and i removed the time column beforehand.
X shape : (25433, 17)
y shape : (25433, 1)
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.33,
random_state=(0))
Standardize afterwards and then i changed them into an int Array because GridSearch doesnt seem to like continuous data. This surely can be done better, but i want this to work before i optimize the coding.
'X'
mean = StandardScaler().fit(X_train)
X_train = mean.transform(X_train)
X_test = mean.transform(X_test)
X_train = np.round(X_train,2)*100
X_train = X_train.astype(int)
X_test = np.round(X_test,2)*100
X_test = X_test.astype(int)
'y'
yeah = StandardScaler().fit(y_train)
y_train = yeah.transform(y_train)
y_test = yeah.transform(y_test)
y_train = np.round(y_train,2)*100
y_train = y_train.astype(int)
y_test = np.round(y_test,2)*100
y_test = y_test.astype(int)
I chose the IsoForrest because its fast, has pretty good results and can handle huge data sets (i currently only use a chunk of the data for testing). SVM might also be an option i want to check out. Then i set up the GridSearchCV
clf = IForest(random_state=47, behaviour='new',
n_jobs=-1)
param_grid = {'n_estimators': [20,40,70,100],
'max_samples': [10,20,40,60],
'contamination': [0.1, 0.01, 0.001],
'max_features': [5,15,30],
'bootstrap': [True, False]}
fbeta = make_scorer(fbeta_score,
average = 'micro',
needs_proba=True,
beta=1)
grid_estimator = model_selection.GridSearchCV(clf,
param_grid,
scoring=fbeta,
cv=5,
n_jobs=-1,
return_train_score=True,
error_score='raise',
verbose=3)
grid_estimator.fit(X_train, y_train)
The Problem:
GridSearchCV needs an y argument, so i think this only works with supervised learning? If i run this i get the following error that i dont understand:
ValueError: Classification metrics can't handle a mix of multiclass and continuous-multioutput targets
You can use
GridSearchCVfor unsupervised learning, but it's often tricky to define a scoring metric that makes sense for the problem.Here's an example in the docs that uses grid search for
KernelDensity, an unsupervised estimator. It works without issue because this estimator has ascoremethod (docs).In your case, since
IsolationForestdoesn't have ascoremethod, you'll need to define a custom scorer to pass as the search'sscoringmethod. There's an answer at this question, and also this question, but I don't think the metrics given there necessarily makes sense. Unfortunately, I don't have a useful outlier detection metric in mind; that's a question better suited for the data science or statistics stackexchange sites.