I'm new with weka and I have a problem with my classification project using it.
I have a train dataset with 1000 instances and one of 200 for testing. The problem is that when I try to test the performance of some algorithms (like randomforest), the number given by cross-validation and test set is really different.
Here is an example with cross-validation
=== Run information ===
Scheme:weka.classifiers.trees.RandomForest -I 100 -K 0 -S 1
Relation: testData-weka.filters.unsupervised.attribute.StringToWordVector-R1-W10000000-prune-rate-1.0-T-I-N0-L-stemmerweka.core.stemmers.IteratedLovinsStemmer-M1-O-tokenizerweka.core.tokenizers.WordTokenizer -delimiters " \r\n\t.,;:\"\'()?!--+-í+*&#$\\/=<>[]_`@"-weka.filters.supervised.attribute.AttributeSelection-Eweka.attributeSelection.InfoGainAttributeEval-Sweka.attributeSelection.Ranker -T 0.0 -N -1
Instances: 1000
Attributes: 276
[list of attributes omitted]
Test mode:10-fold cross-validation
=== Classifier model (full training set) ===
Random forest of 100 trees, each constructed while considering 9 random features.
Out of bag error: 0.269
Time taken to build model: 4.9 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 740 74 %
Incorrectly Classified Instances 260 26 %
Kappa statistic 0.5674
Mean absolute error 0.2554
Root mean squared error 0.3552
Relative absolute error 60.623 %
Root relative squared error 77.4053 %
Total Number of Instances 1000
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.479 0.083 0.723 0.479 0.576 0.795 I
0.941 0.352 0.707 0.941 0.808 0.894 E
0.673 0.023 0.889 0.673 0.766 0.964 R
Weighted Avg. 0.74 0.198 0.751 0.74 0.727 0.878
=== Confusion Matrix ===
a b c <-- classified as
149 148 14 | a = I
24 447 4 | b = E
33 37 144 | c = R
72.5% , it's something...
But now if I try with a my test set of 200 instances...
=== Run information ===
Scheme:weka.classifiers.trees.RandomForest -I 100 -K 0 -S 1
Relation: testData-weka.filters.unsupervised.attribute.StringToWordVector-R1-W10000000-prune-rate-1.0-T-I-N0-L-stemmerweka.core.stemmers.IteratedLovinsStemmer-M1-O-tokenizerweka.core.tokenizers.WordTokenizer -delimiters " \r\n\t.,;:\"\'()?!--+-í+*&#$\\/=<>[]_`@"-weka.filters.supervised.attribute.AttributeSelection-Eweka.attributeSelection.InfoGainAttributeEval-Sweka.attributeSelection.Ranker -T 0.0 -N -1
Instances: 1000
Attributes: 276
[list of attributes omitted]
Test mode:user supplied test set: size unknown (reading incrementally)
=== Classifier model (full training set) ===
Random forest of 100 trees, each constructed while considering 9 random features.
Out of bag error: 0.269
Time taken to build model: 4.72 seconds
=== Evaluation on test set ===
=== Summary ===
Correctly Classified Instances 86 43 %
Incorrectly Classified Instances 114 57 %
Kappa statistic 0.2061
Mean absolute error 0.3829
Root mean squared error 0.4868
Relative absolute error 84.8628 %
Root relative squared error 99.2642 %
Total Number of Instances 200
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.17 0.071 0.652 0.17 0.27 0.596 I
0.941 0.711 0.312 0.941 0.468 0.796 E
0.377 0 1 0.377 0.548 0.958 R
Weighted Avg. 0.43 0.213 0.671 0.43 0.405 0.758
=== Confusion Matrix ===
a b c <-- classified as
15 73 0 | a = I
3 48 0 | b = E
5 33 23 | c = R
43% ... obviously, something is really wrong, I used batch filtering with test set
What am I doing wrong? I manually classified the test and train set using the same criteria, so I find strange that differences.
I think I got the concept behind CV, but maybe I'm wrong.
Thanks
As from your comment, this is the statistics:
CV error: 26%
Test error: 57%
Training error: 1.2%
A low training error is always fishy. The first thing to do if you get low error is to check CV error or Test error. If there is a big difference between the training and test/CV error, then there is a possibility of overfitting. This is a very good sanity test. And of course then you have other methods like learning curves to confirm whether your model overfits or not. Overfitting means that your model cannot generalize well--it only knows the training data or it only is a fit for the training data, in general if you apply this model to unknown data, it will not be able to fit itself.
So, coming back to your problem--CV and Test can be seen as two independent cases of testing generalization capacity of your model. If you have a model that overfits, it will not be able to generalize well. So, in the case of CV it gives one result while for the test, it gives you a different one.
Hope that helps?