I am dealing with a very sparse, and imbalanced dataset that i am reducing the dimensionality of with pca, then feeding that into a knn classifier. I cannot use SMOTE or the imblearn package in this case and simple upsampling hasnt helped much so I'm not looking to correct the imbalance right now. I am also interested more in the output of my model's .predict_proba()
than its actual predictions.
When I fit the model with scoring='roc_auc'
, the output of knn.predict_proba(test_X)
are continuous float values between 0 and 1, just as I'd expect. However, when I set scoring='recall'
in an effort to predict the minority class better, the output of knn.predict_proba(test_X)
is all either 0.00001
or 1.00000
and I dont understand why. If i do the same with a gradient boosted decision tree classifier that I built for the same problem, it still outputs continuous values when I use scoring='roc_auc'
.
My best guess is that there just arent enough cases of the minority class in this dataset and it partly has something to do with knn being a topological method.
I want to understand what could be making this happen.
pca = PCA()
knn = KNeighborsClassifier()
pipe = Pipeline(steps=[('pca', pca), ('knn', knn)])
#Define Parameters
param_grid = {
'pca__n_components': [4, 7, 10, 20, 82],
'knn__n_neigbors': [1,3,5,7,15],
'knn__weights': ['uniform', 'distance']
}
search = GridSearchCV(pipe, param_grid, n_jobs=-1, cv=10, verbose=3, scoring='recall')
======================================================================================
best_params_: {'pca__n_components': 82,
'knn__n_neighbors': 15,
'knn__weights': 'uniform'}