I try to do some binary classification with the flink-ml svm implementation. When I evaluated the classification I got a ~85% error rate on the training dataset. I plotted the 3D data and it looked like you could separate the data quite well with a hyperplane.
When I tried to get the weight vector out of the svm I only saw the option to get the weight vector without the interception of the hyperplane. So just a hyperplane going through (0,0,0).
I don't have any clue where the error could be and appreciate every clue.
val env = ExecutionEnvironment.getExecutionEnvironment
val input: DataSet[(Int, Int, Boolean, Double, Double, Double)] = env.readCsvFile(filepathTraining, ignoreFirstLine = true, fieldDelimiter = ";")
val inputLV = input.map(
t => { LabeledVector({if(t._3) 1.0 else -1.0}, DenseVector(Array(t._4, t._5, t._6)))}
)
val trainTestDataSet = Splitter.trainTestSplit(inputLV, 0.8, precise = true, seed = 100)
val trainLV = trainTestDataSet.training
val testLV = trainTestDataSet.testing
val svm = SVM()
svm.fit(trainLV)
val testVD = testLV.map(lv => (lv.vector, lv.label))
val evalSet = svm.evaluate(testVD)
// groups the data in false negatives, false positives, true negatives, true positives
evalSet.map(t => (t._1, t._2, 1)).groupBy(0,1).reduce((x1,x2) => (x1._1, x1._2, x1._3 + x2._3)).print()
The plotted data is shown here:
The SVM classifier doesn't give you the distance to the origin (aka. bias or threshold), because that's a parameter of the predictor. Different values of the threshold will result in different precision and recall metrics and the optimum is use-case specific. Usually we use a ROC (Receiver Operating Characteristic) curve to find it.
The related properties on
SVM
are (from the Flink docs):true
to output the distance to the separating plane instead of the binary classification.ROC Curve
How to find the optimum threshold is an art in itself. Without knowing anything more about the problem, what you can always do is plot the ROC curve (the True Positive Rate against the False Positive Rate) for different values of the threshold and look for the point with the greatest distance from a random guess (the line with 0.5 slope). But ultimately the choice of threshold also depends on the cost of a false positive vs. the cost of a false negative in your domain. Here is an example ROC curve from Wikipedia for three different classifiers:
To choose the initial threshold you could average it over the training data (or a sample of it):
and then vary it in a loop, measuring the TPR and FPR on the test data.
Other Hyperparameters
Note that the
SVM
trainer also has Parameters (those are called hyperparameters) that need to be tuned for optimal prediction performance. There are many techniques to do that and this post would become too long to list them. I just wanted to bring your attention to that. If you're feeling lazy, here's a link on Wikipedia: Hyperparameter optimization.Other Dimensions?
There is (somewhat of) a hack if you don't want to deal with the threshold right now. You can jam the bias into another dimension of the feature vector like so:
Here is a nice discussion on why you should NOT do this. Basically the problem is that the bias would participate in regularization. But in machine learning there are no absolute truths.