I am new to machine learning and Spark MLlib. I have created RandomForest classifier model using RandomForest.trainClassifier()
My training data set is mostly categorical in nature and have response/target variables as Actionable/NoActionable.
I have created predictionAndLables
RDD
using test data and model.predict()
Now I am trying the following to validate my model accuracy.
MultiClassMetrics metrics = new MultiClassMetrics(predictionAndLables.rdd())
System.out.println(metrics.precision()); //prints 0.94334140435
System.out.println(metrics.confusionMatrix()); //prints like the following
1948.0 0.0
117.0 0.0
Now if you see model accuracy printed using precision()
method seems really good around 94%
but if you see above confusion matrix something seems wrong I have 1948
NonActionable target variables and 117
Actionable target variable in test data set.
So according to confusion matrix it could predict NonActionable correctly and could not predict at all Actionable variables. I am trying to understand the confusion matrix and why precision is 94%
. So results look contradicting.
Imagine your 117 Actionable rows are glued to about 500 NonActionable ones. Now the classifier can move all 617 to the Actionable column and get 500 NonActionable ones wrong or it can move them to the NonActionable column and get 117 wrong. Unless you tell it that the 117 Actionable wrong are more wrong than the 500 NonActionable it will do that. Figure out how to balance the problem (fake out more Actionable items, subsample NonActionable ones, weight Actionable items more heavily etc) AND work on more features so as to weaken the "glue" (make Actionable and NonActionable look as different as possible to the classifier)