Matlab -- SVM -- All Majority Class Predictions with Same Score and AUC = .50

759 views Asked by At

I'm having a weird problem in training an SVM with an RBF kernel in Matlab. The issue is that, when doing a grid search, using 10-fold cross-validation, for the C and Sigma values I always get AUC values equal to approximately .50 (varying between .48 and .54 depending) -- I obtained this from: [X,Y,T,AUC] = perfcurve(dSet1Label(test),label, 1); where dSet1Label(test) are the actual test set labels, and label are the predicted labels. The classifier only predicts the majority class, which constitutes just over 90% of the data.

Upon further investigation, when looking at the scores (obtained from [label,score] = predict(svmStruct, dSet1(test,:)); where svmStruct is a model trained on 9/10ths of the data and dSet1(test,:) is the remaing 1/10th) they are all the same:

0.8323   -0.8323
0.8323   -0.8323
0.8323   -0.8323
0.8323   -0.8323
0.8323   -0.8323
0.8323   -0.8323
0.8323   -0.8323
0.8323   -0.8323
0.8323   -0.8323
  .         .
  .         .
  .         .
0.8323   -0.8323

The data consists of 443 features and 6,453 instances, 542 of which are of the positive class. The features have been scaled to a range of [0,1], per standard SVM protocol. The classes are represented by {-1,1}.

My code is as follows:

load('datafile.m');
boxVals = [1,2,5,10,20,50,100,200,500,1000];
rbfVals = [.0001,.01,.1,1,2,3,5,10,20];
[m,n] = size(dataset1);
[c,v] = size(boxVals);
[e,r] = size(rbfVals);
auc_holder = [];
accuracy_holder = [];
for i = 1:v
     curBox = boxVals(i)
     for j = 1:r
         curRBF = rbfVals(j)
         valInd = crossvalind('Kfold', m, 10);
         temp_auc = [];
         temp_acc = [];
         cp = classperf(dSet1Label);
         for k = 1:10
             test = (valInd==k); train = ~test;
             svmStruct = fitcsvm(dSet1(train,:), dSet1Label(train), 'KernelFunction', 'rbf', 'BoxConstraint', curBox, 'KernelScale', curRBF);
             [label,score] = predict(svmStruct, dSet1(test,:));
             accuracy = sum(dSet1Label(test) == label)/numel(dSet1Label(test));
             [X,Y,T,AUC] = perfcurve(dSet1Label(test),label, 1);
             temp_auc = [temp_auc AUC];
             temp_acc = [temp_acc accuracy];
         end
         avg_auc = mean(temp_auc);
         avg_acc = mean(temp_acc);
         auc_holder = [auc_holder avg_auc];
         accuracy_holder = [accuracy_holder avg_acc];
     end
end

Thanks!

*Edit 1: It appears that, no matter what I set the box constraint to, all data points are considered support vectors.

1

There are 1 answers

2
Trisoloriansunscreen On BEST ANSWER

Unless you have some implementation bug (test your code with synthetic, well separated data), the problem might lay in the class imbalance. This can be solved by adjusting the missclassification cost (See this discussion in CV). I'd use the cost parameter of fitcsvm to increase the missclassification cost of the minority class to be 9 times larger than the majority class and see whether the problem persists. One more issue to consider is class stratification (see crossvalind documentation - you have to define a group parameter so each fold will have a similar class proporition).