I want to make libsvm format, so I made dataframe to the desired format, but I do not know how to convert to libsvm format. The format is as shown in the figure. I hope that the desired libsvm type is user item:rating . If you know what to do in the current situation :
val ratings = sc.textFile(new File("/user/ubuntu/kang/0829/rawRatings.csv").toString).map { line =>
val fields = line.split(",")
(fields(0).toInt,fields(1).toInt,fields(2).toDouble)
}
val user = ratings.map{ case (user,product,rate) => (user,(product.toInt,rate.toDouble))}
val usergroup = user.groupByKey
val data =usergroup.map{ case(x,iter) => (x,iter.map(_._1).toArray,iter.map(_._2).toArray)}
val data_DF = data.toDF("user","item","rating")
I am using Spark 2.0.
The issue you are facing can be divided into the following :
LabeledPoint
data X.1. Converting your ratings into
LabeledPoint
data XLet's consider the following raw ratings :
You can handle those raw ratings as a coordinate list matrix (COO).
Spark implements a distributed matrix backed by an RDD of its entries :
CoordinateMatrix
where each entry is a tuple of (i: Long, j: Long, value: Double).Note : A CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse. (which is usually the case of user/item ratings.)
Now let's convert that
RDD[MatrixEntry]
to aCoordinateMatrix
and extract the indexed rows :2. Saving LabeledPoint data in libsvm format
Since Spark 2.0, You can do that using the
DataFrameWriter
. Let's create a small example with some dummy LabeledPoint data (you can also use theDataFrame
we created earlier) :Unfortunately we still can't use the
DataFrameWriter
directly because while most pipeline components support backward compatibility for loading, some existing DataFrames and pipelines in Spark versions prior to 2.0, that contain vector or matrix columns, may need to be migrated to the new spark.ml vector and matrix types.Utilities for converting DataFrame columns from
mllib.linalg
toml.linalg
types (and vice versa) can be found inorg.apache.spark.mllib.util.MLUtils.
In our case we need to do the following (for both the dummy data and theDataFrame
fromstep 1.
)Now let's save the DataFrame :
And we can check the files contents :
EDIT: In current version of spark (2.1.0) there is no need to use
mllib
package. You can simply saveLabeledPoint
data in libsvm format like below: