pyspark multidimensional scaling

377 views Asked by At

I would like to perform a multidimensional scaling on pyspark DataFrame. I know how to solve my problem using pandas + sklearn, but I am struggling with spark dataframe. Here is the pandas based solution:

from sklearn.metrics.pairwise import euclidean_distances
from sklearn import manifold

input_pandas_df = spark_df.toPandas()
distances = euclidean_distances(input_pandas_df )
scaled_distance_matrix = manifold.MDS(n_components=2).fit_transform(distances)

In terms of the first part of the above algorithm, I have an idea of an algorithm, but I don't know how to implement it.

mapper(ri):
    for all pairs (aij , aik) in ri do
         Emit ((cj , ck) → aijaik)
    end for

and the second problem is how to apply manifold.MDS.

0

There are 0 answers