How to do text classification with Accord.Net?

3.7k views Asked by At

I'm trying to use accord.net for text classifaction. But I can't find a way to represent sparse vectors and matrices. For example, we have a lot of texts and after tokenization with ngrams and hashing, every text represented as a feature index(given by featureHasher) with a weight(tf). And it is impossible to load all data as a non sparse matrix into a memory. Is there a way to do incremental processing or represent sparse matrix or do feature reduction with sparse data?

1

There are 1 answers

4
Cesar On BEST ANSWER

Unfortunately, not all models and methods support sparse matrices at this time. However, if you are trying to do text categorization, you might be able to do it using a Support Vector Machine with a Sparse kernel.

Sparse kernels can be found in the Accord.Statistics.Kernels.Sparse namespace, such as for example the SparseLinear and SparseGaussian. Those kernels expect data to be given in LibSVM's Sparse format. The specification for this format can be found in LibSVM's FAQ under the question Why sometimes not all attributes of a data appear in the training/model files?.

Basically, in this format, a feature vector that would be represented as

1 0 2 0

is represented as

1:1 3:2

or in other words, as a list of position:value pairs, where position starts at 1.

Here is an example on how to use SVMs with a SparseLinear kernel using LibSVM's sparse linear format:

// Example AND problem
double[][] inputs =
{
    new double[] {          }, // 0 and 0: 0 (label -1)
    new double[] {      2,1 }, // 0 and 1: 0 (label -1)
    new double[] { 1,1      }, // 1 and 0: 0 (label -1)
    new double[] { 1,1, 2,1 }  // 1 and 1: 1 (label +1)
};

// Dichotomy SVM outputs should be given as [-1;+1]
int[] labels =
{
    // 0,  0,  0, 1
        -1, -1, -1, 1
};

// Create a Support Vector Machine for the given inputs
// (sparse machines should use 0 as the number of inputs)
var machine = new KernelSupportVectorMachine(new SparseLinear(), inputs: 0); 

// Instantiate a new learning algorithm for SVMs
var smo = new SequentialMinimalOptimization(machine, inputs, labels);

// Set up the learning algorithm
smo.Complexity = 100000.0;

// Run
double error = smo.Run(); // should be zero

double[] predicted = inputs.Apply(machine.Compute).Sign();

// Outputs should be -1, -1, -1, +1
Assert.AreEqual(-1, predicted[0]);
Assert.AreEqual(-1, predicted[1]);
Assert.AreEqual(-1, predicted[2]);
Assert.AreEqual(+1, predicted[3]);