I'm developing a Scala-based extreme learning machine, in Apache Spark. My model has to be a Spark Estimator and use the Spark framework in order to fit into the machine learning pipeline. Does anyone know if Breeze can be used in tandem with Spark? All of my data is in Spark data frames and conceivably I could import it using Breeze, use Breeze DenseVectors as the data structure then convert to a DataFrame for the Estimator part. The advantage of Breeze is that it has a function pinv
for the Moore-Penrose pseudo-inverse, which is an inverse for a non-square matrix. There is no equivalent function in the Spark MLlib, as far as I can see. I have no idea whether it's possible to convert Breeze tensors to Spark DataFrames so if anyone has experience of this it would be really useful. Thanks!
Can Spark and the ScalaNLP library Breeze be used together?
1.3k views Asked by LucieCBurgess At
1
Breeze can be used with Spark. In fact is used internally for many MLLib functions, but required conversions are not exposed as
public
. You can add your own conversions and use Breeze to process individual records.For example for
Vectors
you can find conversion code:SparseVector.asBreeze
DenseVector.asBreeze
Vector.fromBreeze
For
Matrices
please seeasBreeze
/fromBreeze
in Matrices.scalaIt cannot however, be used on distributed data structures. Breeze objects use low level libraries, which cannot be used for distributed processing. Therefore
DataFrame
- Breeze objects conversions are possible only if youcollect
data to the driver and are limited to the scenarios where data can be stored in the driver memory.There exist other libraries, like SysteML, which integrate with Spark and provide more comprehensive linear algebra routines on distributed objects.