I am trying to build a linear svm classifier to classify unknown test data.
However, as text documents do not have a fixed length, how do I ensure that the new documents have the same feature length?
Src and Dest differ in # of attributes: 2 != 1484
LibSVM classifier = new LibSVM();
classifier.setKernelType(new SelectedTag(LibSVM.KERNELTYPE_LINEAR, LibSVM.TAGS_KERNELTYPE));
classifier.buildClassifier(data1);
System.out.println("done");
data2.setClassIndex(data2.numAttributes() - 1);
double res = classifier.classifyInstance(data2.instance(0));
Data2 arff
@data
'This is a string!','?'
Is there anyway I could build a feature vector with the same number of attributes with the current model? Or would there be any solution other than this.
I doubt that this will work, because SVMs can only handle numeric data. If you want to use strings, you either have to use another kernel, or convert your string data into numerical data using a filter.
I suggest you try the StringToWordVector filter:
As the description of that filter says: You batch filter the training data first, which will initialize the filter. If you then apply the filter to your test data (even new unknown data), the result will always be compatible to your filtered training data.
The big question is if your model has to survive the termination of your program. If not, no problem.
(source)
Since your filter has been initialized on your training data, you can now apply it to any data set that looks like the unfiltered training set by repeating the last line there
If you want to save your model and apply it over and over during multiple runs of your application, you should use the
FilteredClassifier
. (Have a look at this answer, where I explained the use ofFilteredClassifier
.) tl;dr: The filter is part of the classifier and can be serialized along with it, preserving the transformation on the input data.