Does MATLAB support the parallelization of supervised machine learning algorithms? Alternatives?

376 views Asked by At

Up to now I have used RapidMiner for some data/text mining tasks, but with an increasing amount of data there are huge performance issues. AFAIK the RapidMiner Parallel Processing Extensions is only available for the enterprise version - unfortunately I am limited to the community version.

Now I want to transfer the tasks to a high performance cluster by using MATLAB (academic license). I did not find any information that the Parallel Computation Toolbox supports e.g. SVM or KNN.

Does MATLAB or any additional libraries support the paralleliization of data mining algorithms?

1

There are 1 answers

2
Sam Roberts On

Most data mining and machine learning functionality for MATLAB is contained within Statistics Toolbox (in recent versions, that's called Statistics and Machine Learning Toolbox). To enable parallelization, you'll also need Parallel Computing Toolbox, and to enable that parallelization to be carried out on an HPC cluster, you'll need to install MATLAB Distributed Computing Server on the cluster.

There are lots of ways that you might want to parallelize data mining tasks - for example, you might want to parallelize an individual learning task, or parallelize a cross-validation, or parallelize several learning tasks across multiple datasets.

The first is possible for some, but not all of the data mining algorithms in Statistics Toolbox. MathWorks are gradually introducing that piece by piece. For example, kmeans is parallelized, and there is a parallelized algorithm for bagged decision trees, but I believe SVM learning is currently not parallelized. You'll need to look into the documentation for Statistics Toolbox to find out if the algorithms you require are on the list.

The second two are possible. Functionality in Statistics Toolbox for cross-validation (and bootstrapping, jack-knifing) is parallelized, as are some feature selection algorithms. And in order to parallelize running several jobs over multiple datasets, you can use functionality from Parallel Computing Toolbox (such as a parfor or parallel for loop) to iterate over them.

In addition, the upcoming R2015b release of MATLAB (out in September) will include GPU-enabled statistics functionality, providing additional speedups.