Differing number of features in test and training set

361 views Asked by At

I am trying to build a linear svm classifier to classify unknown test data.

However, as text documents do not have a fixed length, how do I ensure that the new documents have the same feature length?

Src and Dest differ in # of attributes: 2 != 1484

 LibSVM classifier = new LibSVM();
 classifier.setKernelType(new SelectedTag(LibSVM.KERNELTYPE_LINEAR, LibSVM.TAGS_KERNELTYPE));
 classifier.buildClassifier(data1);


 System.out.println("done");
 data2.setClassIndex(data2.numAttributes() - 1);
 double res = classifier.classifyInstance(data2.instance(0));

Data2 arff

@data
'This is a string!','?'

Is there anyway I could build a feature vector with the same number of attributes with the current model? Or would there be any solution other than this.

1

There are 1 answers

2
Sentry On

I doubt that this will work, because SVMs can only handle numeric data. If you want to use strings, you either have to use another kernel, or convert your string data into numerical data using a filter.

I suggest you try the StringToWordVector filter:

Converts String attributes into a set of attributes representing word occurrence (depending on the tokenizer) information from the text contained in the strings. The set of words (attributes) is determined by the first batch filtered (typically training data).

As the description of that filter says: You batch filter the training data first, which will initialize the filter. If you then apply the filter to your test data (even new unknown data), the result will always be compatible to your filtered training data.

The big question is if your model has to survive the termination of your program. If not, no problem.

Instances train = ...   // from somewhere
Instances test = ...    // from somewhere
Standardize filter = new Standardize();
filter.setInputFormat(train);  // initializing the filter once with training set
Instances newTrain = Filter.useFilter(train, filter);  // configures the Filter based on train instances and returns filtered instances
Instances newTest = Filter.useFilter(test, filter);    // create new test set

(source)

Since your filter has been initialized on your training data, you can now apply it to any data set that looks like the unfiltered training set by repeating the last line there

Instances newTest2 = Filter.useFilter(test2, filter);    // create another new test set

If you want to save your model and apply it over and over during multiple runs of your application, you should use the FilteredClassifier. (Have a look at this answer, where I explained the use of FilteredClassifier.) tl;dr: The filter is part of the classifier and can be serialized along with it, preserving the transformation on the input data.