I have the following dataframe:
new_df =
BankNum | ID | Labels
0098-7772 | AB123 | High
0098-7772 | ED245 | High
0098-7772 | ED343 | High
0870-7771 | ED200 | Mod
0870-7771 | ED100 | Mod
0098-2123 | GH564 | Low
I am using scikit's SVC
to predict the Labels 'High'
, 'Mod'
, and 'Low'
. I'm doing it as follows:
new_df['BankNum'] = new_df['BankNum'].map(lambda x: x.replace('-',''))
new_df['BankNum'] = new_df.BankNum.astype(np.float128)
columns =['BankNum', 'ID']
le = LabelEncoder()
new_df['ID'] = le.fit_transform(new_df.ID)
new_df['Labels'] = le.fit_transform(new_df.Labels)
X_train, X_test, y_train, y_test = train_test_split(new_df[columns], new_df.Labels, test_size=0.2, random_state=42)
clf = svm.SVC(gamma=0.001, C=100., probability=True, random_state=42)
scores = cross_val_score(clf, X_train, y_train, cv=8)
print "Cross Validation Score: "
print scores.mean()
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
print "Accuracy: "
print(np.mean(predicted == y_test))
print(metrics.classification_report(y_test, predicted))
I have two questions:
1.) For the classification report I'm getting a output like this:
precision recall f1-score support
0 0.00 0.00 0.00 4780
1 0.94 1.00 0.97 104719
2 0.00 0.00 0.00 1425
avg / total 0.89 0.94 0.92 110924
Why do label 0 & 2, get 0.00 precision? Can this be because of class imbalance? There are about 80893 High labels, 11798 Mod labels & 279608 Low labels. OR is SVm not a good model for this?
2.) I want to get a confidence score for each prediction. I googled and found something as follows:
p = clf.predict_proba( X_test )
auc = AUC(y_test, p[:,1] )
print "SVM AUC", auc
But I'm getting error: raise ValueError("{0} format is not supported".format(y_typeValueError: multiclass format is not supported
How do I get a confidence measure for each prediction and then interpret it as well? Many thanks!!