I am quite new in Liblinear/Libsvm and i'm with a good problem here.
I have very big data for training (2.883.584 samples highly unbalanced, each of them 21-dimensional) and also big data for testing (262.144 samples also with 21 dimensions). I'm using the linear kernel implementation of LIBSVM (or LibLinear) because the big data nature of my data. The literature warns me the issues of using RBF kernels with these data.
My problem is: no matter what i do, the classifier only predicts one class (the class with more samples, or the negative class in my experiments).
I tried so far:
1- Train balanced and imbalanced data, do not scale the data and no parameter selection.
2- Train balanced and imbalanced data, scale the data with different ranges ([-1,1] and [0,1]) but no parameter selection.
3- Train balanced and imbalanced data, scale the data with different ranges ([-1,1] and [0,1]) with parameter selection.
All of these experiments result an 81% of accuracy, but these right predictions are all from the negative class. All the positive classes are misclassified by the linear svm.
The .model file is very weird as you can see below:
solver_type L2R_L2LOSS_SVC_DUAL
nr_class 2
label 1 -1
nr_feature 21
bias -1
w
0
0
nan
nan
0
0
0
0
0
nan
nan
0
0
0
0
0
nan
nan
0
0
0
When I do the parameter selection via grid search the best C always gives me a 5-fold cross validation best accuracy of 50%. That's how I do the grid search in Matlab:
for log2c = 1:100,
cmd = ['-v 5 -c ', num2str(2^log2c)];
cv = train(label, inst, cmd);
if (cv >= bestcv),
bestcv = cv; bestc = 2^log2c;
end
fprintf('%g %g (best c=%g, rate=%g)\n', log2c, cv, bestc, bestcv);
end
EDIT: Here is one positive and negative sample of my training data:
1 1:4.896000e+01 2:3.374349e+01 3:2.519652e-01 4:1.289031e+00 5:48 6:4.021792e-01 7:136 8:4.069388e+01 9:2.669129e+01 10:-3.017949e-02 11:3.096163e+00 12:36 13:3.322866e-01 14:136 15:4.003704e+01 16:2.168262e+01 17:1.101631e+00 18:3.496498e+00 19:36 20:2.285381e-01 21:136
-1 1:5.040000e+01 2:3.251025e+01 3:2.260981e-01 4:2.523418e+00 5:48 6:4.021792e-01 7:136 8:4.122449e+01 9:2.680350e+01 10:5.681589e-01 11:3.273471e+00 12:36 13:3.322866e-01 14:136 15:4.027160e+01 16:2.245051e+01 17:6.281671e-01 18:2.977574e+00 19:36 20:2.285381e-01 21:136
And here is one positive and negative sample of my testing data:
1 1:71 2:2.562365e+01 3:3.154359e-01 4:1.728250e+00 5:76 6:0 7:121 8:7.067857e+01 9:3.185273e+01 10:-8.272995e-01 11:2.193058e+00 12:74 13:0 14:121 15:6.675556e+01 16:3.624485e+01 17:-1.863971e-01 18:1.382679e+00 19:76 20:3.533593e-01 21:128
-1 1:5.606667e+01 2:2.480630e+01 3:1.291811e-01 4:1.477127e+00 5:65 6:0 7:76 8:5.610714e+01 9:3.602092e+01 10:-9.018124e-01 11:2.236301e+00 12:67 13:4.912373e-01 14:128 15:5.886667e+01 16:3.891050e+01 17:-5.167622e-01 18:1.527146e+00 19:69 20:3.533593e-01 21:128
Is there something wrong with my data? should I increase the C range in grid-search? or should I use another classifier?
For unbalanced case, the costs of false-positive and false-negative errors are not the same, so the penalty for positive and negative class should be different. You may need to choose the weight C+ and C- for each class. If you have more negative patterns than positive patterns then you probably want to make C+ larger than C−
model = svmtrain(trainLabels, trainFeatures, '-h 0 -b 1 -s 0 -t 0 -c 10 -w1 C+ -w-1 C-');
Usually
C+ * N+ = C- * N-
where N+ and N- are the sample numbers of positive and negative class respectively.Also make sure you choose the correct options. For your case that training sample number is much larger than feature numbers, linear kernel is the best option as you said in your post.