Big accuracy difference between cross-validation and testing with a test set in weka? is it normal?

1.7k views Asked by At

I'm new with weka and I have a problem with my classification project using it.

I have a train dataset with 1000 instances and one of 200 for testing. The problem is that when I try to test the performance of some algorithms (like randomforest), the number given by cross-validation and test set is really different.

Here is an example with cross-validation

=== Run information ===

Scheme:weka.classifiers.trees.RandomForest -I 100 -K 0 -S 1
Relation:     testData-weka.filters.unsupervised.attribute.StringToWordVector-R1-W10000000-prune-rate-1.0-T-I-N0-L-stemmerweka.core.stemmers.IteratedLovinsStemmer-M1-O-tokenizerweka.core.tokenizers.WordTokenizer -delimiters " \r\n\t.,;:\"\'()?!--+-í+*&#$\\/=<>[]_`@"-weka.filters.supervised.attribute.AttributeSelection-Eweka.attributeSelection.InfoGainAttributeEval-Sweka.attributeSelection.Ranker -T 0.0 -N -1
Instances:    1000
Attributes:   276
[list of attributes omitted]
Test mode:10-fold cross-validation

=== Classifier model (full training set) ===

Random forest of 100 trees, each constructed while considering 9 random features.
Out of bag error: 0.269



Time taken to build model: 4.9 seconds

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances         740               74      %
Incorrectly Classified Instances       260               26      %
Kappa statistic                          0.5674
Mean absolute error                      0.2554
Root mean squared error                  0.3552
Relative absolute error                 60.623  %
Root relative squared error             77.4053 %
Total Number of Instances             1000     

=== Detailed Accuracy By Class ===

               TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
                 0.479     0.083      0.723     0.479     0.576      0.795    I
                 0.941     0.352      0.707     0.941     0.808      0.894    E
                 0.673     0.023      0.889     0.673     0.766      0.964    R
Weighted Avg.    0.74      0.198      0.751     0.74      0.727      0.878

=== Confusion Matrix ===

   a   b   c   <-- classified as
 149 148  14 |   a = I
  24 447   4 |   b = E
  33  37 144 |   c = R

72.5% , it's something...

But now if I try with a my test set of 200 instances...

=== Run information ===

Scheme:weka.classifiers.trees.RandomForest -I 100 -K 0 -S 1
Relation:     testData-weka.filters.unsupervised.attribute.StringToWordVector-R1-W10000000-prune-rate-1.0-T-I-N0-L-stemmerweka.core.stemmers.IteratedLovinsStemmer-M1-O-tokenizerweka.core.tokenizers.WordTokenizer -delimiters " \r\n\t.,;:\"\'()?!--+-í+*&#$\\/=<>[]_`@"-weka.filters.supervised.attribute.AttributeSelection-Eweka.attributeSelection.InfoGainAttributeEval-Sweka.attributeSelection.Ranker -T 0.0 -N -1
Instances:    1000
Attributes:   276
[list of attributes omitted]
Test mode:user supplied test set: size unknown (reading incrementally)

=== Classifier model (full training set) ===

Random forest of 100 trees, each constructed while considering 9 random features.
Out of bag error: 0.269



Time taken to build model: 4.72 seconds

=== Evaluation on test set ===
=== Summary ===

Correctly Classified Instances          86               43      %
Incorrectly Classified Instances       114               57      %
Kappa statistic                          0.2061
Mean absolute error                      0.3829
Root mean squared error                  0.4868
Relative absolute error                 84.8628 %
Root relative squared error             99.2642 %
Total Number of Instances              200     

=== Detailed Accuracy By Class ===

               TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
                 0.17      0.071      0.652     0.17      0.27       0.596    I
                 0.941     0.711      0.312     0.941     0.468      0.796    E
                 0.377     0          1         0.377     0.548      0.958    R
Weighted Avg.    0.43      0.213      0.671     0.43      0.405      0.758

=== Confusion Matrix ===

  a  b  c   <-- classified as
 15 73  0 |  a = I
  3 48  0 |  b = E
  5 33 23 |  c = R

43% ... obviously, something is really wrong, I used batch filtering with test set

What am I doing wrong? I manually classified the test and train set using the same criteria, so I find strange that differences.

I think I got the concept behind CV, but maybe I'm wrong.

Thanks

1

There are 1 answers

0
Rushdi Shams On BEST ANSWER

As from your comment, this is the statistics:

CV error: 26%

Test error: 57%

Training error: 1.2%

A low training error is always fishy. The first thing to do if you get low error is to check CV error or Test error. If there is a big difference between the training and test/CV error, then there is a possibility of overfitting. This is a very good sanity test. And of course then you have other methods like learning curves to confirm whether your model overfits or not. Overfitting means that your model cannot generalize well--it only knows the training data or it only is a fit for the training data, in general if you apply this model to unknown data, it will not be able to fit itself.

So, coming back to your problem--CV and Test can be seen as two independent cases of testing generalization capacity of your model. If you have a model that overfits, it will not be able to generalize well. So, in the case of CV it gives one result while for the test, it gives you a different one.

Hope that helps?