I'm having a weird problem in training an SVM with an RBF kernel in Matlab. The issue is that, when doing a grid search, using 10-fold cross-validation, for the C and Sigma values I always get AUC values equal to approximately .50 (varying between .48 and .54 depending) -- I obtained this from: [X,Y,T,AUC] = perfcurve(dSet1Label(test),label, 1);
where dSet1Label(test)
are the actual test set labels, and label
are the predicted labels. The classifier only predicts the majority class, which constitutes just over 90% of the data.
Upon further investigation, when looking at the scores (obtained from [label,score] = predict(svmStruct, dSet1(test,:));
where svmStruct
is a model trained on 9/10ths of the data and dSet1(test,:)
is the remaing 1/10th) they are all the same:
0.8323 -0.8323
0.8323 -0.8323
0.8323 -0.8323
0.8323 -0.8323
0.8323 -0.8323
0.8323 -0.8323
0.8323 -0.8323
0.8323 -0.8323
0.8323 -0.8323
. .
. .
. .
0.8323 -0.8323
The data consists of 443 features and 6,453 instances, 542 of which are of the positive class. The features have been scaled to a range of [0,1]
, per standard SVM protocol. The classes are represented by {-1,1}
.
My code is as follows:
load('datafile.m');
boxVals = [1,2,5,10,20,50,100,200,500,1000];
rbfVals = [.0001,.01,.1,1,2,3,5,10,20];
[m,n] = size(dataset1);
[c,v] = size(boxVals);
[e,r] = size(rbfVals);
auc_holder = [];
accuracy_holder = [];
for i = 1:v
curBox = boxVals(i)
for j = 1:r
curRBF = rbfVals(j)
valInd = crossvalind('Kfold', m, 10);
temp_auc = [];
temp_acc = [];
cp = classperf(dSet1Label);
for k = 1:10
test = (valInd==k); train = ~test;
svmStruct = fitcsvm(dSet1(train,:), dSet1Label(train), 'KernelFunction', 'rbf', 'BoxConstraint', curBox, 'KernelScale', curRBF);
[label,score] = predict(svmStruct, dSet1(test,:));
accuracy = sum(dSet1Label(test) == label)/numel(dSet1Label(test));
[X,Y,T,AUC] = perfcurve(dSet1Label(test),label, 1);
temp_auc = [temp_auc AUC];
temp_acc = [temp_acc accuracy];
end
avg_auc = mean(temp_auc);
avg_acc = mean(temp_acc);
auc_holder = [auc_holder avg_auc];
accuracy_holder = [accuracy_holder avg_acc];
end
end
Thanks!
*Edit 1: It appears that, no matter what I set the box constraint to, all data points are considered support vectors.
Unless you have some implementation bug (test your code with synthetic, well separated data), the problem might lay in the class imbalance. This can be solved by adjusting the missclassification cost (See this discussion in CV). I'd use the
cost
parameter offitcsvm
to increase the missclassification cost of the minority class to be 9 times larger than the majority class and see whether the problem persists. One more issue to consider is class stratification (see crossvalind documentation - you have to define agroup
parameter so each fold will have a similar class proporition).