ROC Curve is convex

1.1k views Asked by At

I am doing a ROC plot (and AUC calculation) of default frequencies, using logistic regression with one multi-class classifier 'sub_grade.' Assume lcd is a dataframe containing the initial data.

Xtrain, Xtest, ytrain, ytest  =  train_test_split(X,y,test_size=0.50,random_state=123)
# Assign only sub_grade as a feature, Default as response
X = lcd['sub_grade']
y = lcd['Default']

Xtrain, Xtest, ytrain, ytest  =  train_test_split(X,y,test_size=0.50,random_state=123)

logreg = lm.LogisticRegression()
logreg.fit(Xtrain, ytrain)
probas = logreg.predict_proba(Xtest)

# Get classification probabilities from log reg 
y_probas = logreg.predict_proba(Xtest)[:,1]
# Generate ROC Curve from ytest and y_probas
fpr, tpr, thresholds= roc_curve(ytest, y_probas)

The result ROC curve is convex, and the AUC score is ~ 0.35. Why is this? I thought ROC curves order the classification according to frequencies. The outcome would imply that the classes with the highest pct of defaults have the lowest predicted probability of occurring.

Am I interpreting this correctly?

2

There are 2 answers

0
GPB On

Update: the issue lay with how I am using the lm classifier. The coefficient changes sign if the order of the feature classifier is reversed. I must not understand this bit.

0
dukebody On

A ROC-AUC score of lower than 0.5 means that your classifier is predicting worse than random, i.e. the patter you learn from the train data is the opposite that is later found in the test data.

This seldom happens, and can be corrected easily by predicting probabilities 1 - current_probability.

Reasons why this might be happening:

  • The training and the test data patterns differ heavily, or there is no real global pattern.
  • Your model is overfitting pretty hard.

In your case, since you are using only one feature and therefore overfitting due to too many parameters is unlikely, I guess there is no global correlation between your feature and your target, and therefore you are fitting only noise.