I would like to perform a multidimensional scaling on pyspark DataFrame. I know how to solve my problem using pandas + sklearn, but I am struggling with spark dataframe. Here is the pandas based solution:
from sklearn.metrics.pairwise import euclidean_distances
from sklearn import manifold
input_pandas_df = spark_df.toPandas()
distances = euclidean_distances(input_pandas_df )
scaled_distance_matrix = manifold.MDS(n_components=2).fit_transform(distances)
In terms of the first part of the above algorithm, I have an idea of an algorithm, but I don't know how to implement it.
mapper(ri):
for all pairs (aij , aik) in ri do
Emit ((cj , ck) → aijaik)
end for
and the second problem is how to apply manifold.MDS.