FlinkML 0.10.1 Multiple Linear Regression with Sparse Vectors for Training

280 views Asked by At


I'm trying to test out Flink ML 0.10.1 by doing a linear regression as described here:


I'm using SparseVectors instead of DenseVector, but encountering this issue when trying to train the model:

java.lang.IllegalArgumentException: axpy only supports adding to a dense vector but got type class org.apache.flink.ml.math.SparseVector.
    at org.apache.flink.ml.math.BLAS$.axpy(BLAS.scala:60)
    at org.apache.flink.ml.optimization.GradientDescent$$anonfun$org$apache$flink$ml$optimization$GradientDescent$$SGDStep$2.apply(GradientDescent.scala:181)
    at org.apache.flink.ml.optimization.GradientDescent$$anonfun$org$apache$flink$ml$optimization$GradientDescent$$SGDStep$2.apply(GradientDescent.scala:177)
    at org.apache.flink.api.scala.DataSet$$anon$7.reduce(DataSet.scala:583)
    at org.apache.flink.runtime.operators.chaining.ChainedAllReduceDriver.collect(ChainedAllReduceDriver.java:93)
    at org.apache.flink.runtime.operators.MapDriver.run(MapDriver.java:97)
    at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:489)
    at org.apache.flink.runtime.iterative.task.AbstractIterativeTask.run(AbstractIterativeTask.java:144)
    at org.apache.flink.runtime.iterative.task.IterationIntermediateTask.run(IterationIntermediateTask.java:92)
    at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:354)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
    at java.lang.Thread.run(Thread.java:745)

Does FlinkML MLG not support SparseVectors?


There are 2 answers

Till Rohrmann On

The problem is that the GradientDescent implementation expects the sum of gradient vectors to be dense. This is not a strong limitation because the result of summing a set of sparse vectors does not have to be sparse again. Furthermore, it is often more efficient to convert the first gradient vector into a dense vector and then adding the following sparse gradient vectors to it instead of adding 2 sparse vectors all the time.

I've opened a pull request to fix this issue. It should be merged in the next days.

Chobeat On

I checked the source and it looks like that. There's an explicit check for types there and the case where the left vector is sparse raise that error. The code is really ugly so probably it's just a temporary version and will be improved over time. You should point it out on the mailing list or open an issue on JIRA.