I'm using a diabetics dataset which has 3 classes for the target variable. I have used Decision Tree Classifier for the same and optimized the hyperparameters using RandomizedSearchCV
of sci-kit learn package and fitted the model to training data. Now, I have found the probability values for the test data which gives the probability for assigning the outcome variable to the 3 classes. Now, I want to calculate the cutoff value such that I can use it to assign the classes. For this purpose, I'm using F1 score to find the appropriate cut off value.
Now, I'm stuck how to find the F1 score. Will the F1 score metric help me to find it?
Here is the dataset
After preprocessing the data, I have spitted the data into training and testing set.
dtree = DecisionTreeClassifier()
params = {'class_weight':[None,'balanced'],
'criterion':['entropy','gini'],
'max_depth':[None,5,10,15,20,30,50,70],
'min_samples_leaf':[1,2,5,10,15,20],
'min_samples_split':[2,5,10,15,20]}
grid_search = RandomizedSearchCV(dtree,cv=10,n_jobs=-1,n_iter=10,scoring='roc_auc_ovr',verbose=20,param_distributions=params)
grid_search.fit(X_train,y_train)
mdl.fit(X_train,y_train)
test_score = mdl.predict_proba(X_test)
The following formula I have created for cutoff for binary classifier -
cutoffs = np.linspace(0.01,0.99,99)
true = y_train
train_score = mdl.predict_proba(X_train)[:,1]
F1_all = []
for cutoff in cutoffs:
pred = (train_score>cutoff).astype(int)
TP = ((pred==1)&(true==1)).sum()
FP = ((pred==1)&(true==0)).sum()
TN = ((pred==0)&(true==0)).sum()
FN = ((pred==0)&(true==1)).sum()
F1 = TP/(TP+0.5*(FP+FN))
F1_all.append(F1)
my_cutoff = cutoffs[F1_all==max(F1_all)][0]
preds = (test_score1>my_cutoff).astype(int)
There is no cutoff value for the softmax output of a multiclass classifier in the same sense as the cutoff value for binary classifier.
When your output is normalized probabilities for multiple classes and you want to convert this into class labels, you just take the label with the highest assigned probability.
Technically you could design some custom schema such as
class1
has probability of 10% or more, chooseclass1
label, otherwise pick a class with the highest assigned probabilitywhich would be sort of a cutoff for class 1 but this is rather arbitrary and I have not seen anyone doing this in practice. If you have some deep insight into your problem which is suggesting that something like this may be useful then go ahead and build your own "cutoff" formula, otherwise you should just stick with the general approach (argmax of the normalized probabilities).