I am doing a ROC plot (and AUC calculation) of default frequencies, using logistic regression with one multi-class classifier 'sub_grade.' Assume lcd is a dataframe containing the initial data.
Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,test_size=0.50,random_state=123)
# Assign only sub_grade as a feature, Default as response
X = lcd['sub_grade']
y = lcd['Default']
Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,test_size=0.50,random_state=123)
logreg = lm.LogisticRegression()
logreg.fit(Xtrain, ytrain)
probas = logreg.predict_proba(Xtest)
# Get classification probabilities from log reg
y_probas = logreg.predict_proba(Xtest)[:,1]
# Generate ROC Curve from ytest and y_probas
fpr, tpr, thresholds= roc_curve(ytest, y_probas)
The result ROC curve is convex, and the AUC score is ~ 0.35. Why is this? I thought ROC curves order the classification according to frequencies. The outcome would imply that the classes with the highest pct of defaults have the lowest predicted probability of occurring.
Am I interpreting this correctly?
Update: the issue lay with how I am using the lm classifier. The coefficient changes sign if the order of the feature classifier is reversed. I must not understand this bit.