I'm trying to get Weka to predict from the command line, but I'm concerned I might be doing this wrong. I read the Data Mining book and searched their site for documentation, yet what I found was vague at best, so I hope you can help me.
First, I created a training set (train.arff). Here's a sample:
@relation test
@attribute 'A' {0,1}
@attribute 'B' {0,1}
@attribute 'C' {0,1}
@attribute 'D' {0,1}
@attribute 'E' {0,1}
@attribute 'F' {0,1}
@data
0,0,0,0,0,0
0,0,0,0,0,0
...
Then I created data set to be completed by prediction (test.arff):
@relation test
@attribute 'A' {0,1}
@attribute 'B' {0,1}
@attribute 'C' {0,1}
@attribute 'D' {0,1}
@attribute 'E' {0,1}
@attribute 'F' {0,1}
@data
0,?,0,0,0,0
0,?,0,0,0,0
...
The "?" marks the attribute that should be predicted.
Finally, I attempted to get the predictions by running this on the command line:
java weka.classifiers.trees.J48 -t train.arff -T test.arff -p 0
It produces the following output:
=== Predictions on test data ===
inst# actual predicted error prediction
1 2:1 2:1 0.939
2 2:1 2:1 0.939
I then took the number after the ":" in the predicted column for the prediction for the data row marked by inst#.
Here are my questions:
Is this correct? I'm concerned about "?" as I read that it may be imputed (although that may be only during the learning phase).
Does Weka support multiple predictions? No matter how many fields are marked with "?" I always get the same table with only one predicted value per instance.
Can Weka generate a complete (predicted) ARFF file, or do I have to construct this myself from its results?
If I missed something glaringly obvious, apologies in advance and any pointers to relevant documentation would be greatly appreciated.
Thanks in advance!
The '?' is a generic marker for an unknown value. It can be used in training and test data and tells Weka that in this particular case, the value is not available. What is then done with that information depends on the actual learning algorithm. So to answer your questions:
-c
argument. This argument gives the index of the attribute to predict. By default, it's the last one, so 'F' in your case.Note that you can save a trained model and then use it to make predictions. The latter page also contains the knowledge flow you can construct to save the results of this as an ARFF file.