How come when using KNN on an imbalanced dataset, setting scoring='recall' turns all my probabilities to binary?

332 views Asked by At

I am dealing with a very sparse, and imbalanced dataset that i am reducing the dimensionality of with pca, then feeding that into a knn classifier. I cannot use SMOTE or the imblearn package in this case and simple upsampling hasnt helped much so I'm not looking to correct the imbalance right now. I am also interested more in the output of my model's .predict_proba() than its actual predictions.

When I fit the model with scoring='roc_auc', the output of knn.predict_proba(test_X) are continuous float values between 0 and 1, just as I'd expect. However, when I set scoring='recall' in an effort to predict the minority class better, the output of knn.predict_proba(test_X) is all either 0.00001 or 1.00000 and I dont understand why. If i do the same with a gradient boosted decision tree classifier that I built for the same problem, it still outputs continuous values when I use scoring='roc_auc'.

My best guess is that there just arent enough cases of the minority class in this dataset and it partly has something to do with knn being a topological method.

I want to understand what could be making this happen.

pca = PCA()
knn = KNeighborsClassifier()
pipe = Pipeline(steps=[('pca', pca), ('knn', knn)])
#Define Parameters
param_grid = {
     'pca__n_components': [4, 7, 10, 20, 82],
     'knn__n_neigbors': [1,3,5,7,15],
     'knn__weights': ['uniform', 'distance'] 
     }
search = GridSearchCV(pipe, param_grid, n_jobs=-1, cv=10, verbose=3, scoring='recall')
======================================================================================
best_params_: {'pca__n_components': 82, 
               'knn__n_neighbors': 15, 
               'knn__weights': 'uniform'}
0

There are 0 answers