How to do distributed Principal Components Analysis + Kmeans using Apache Spark?

853 views Asked by At

I need to run Principal Components Analysis and K-means clustering on a large-ish dataset (around 10 GB) which is spread out over many files. I want to use Apache Spark for this since it's known to be fast and distributed.

I know that Spark supports PCA and also PCA + Kmeans.

However, I haven't found an example which demonstrates how to do this with many files in a distributed manner.

0

There are 0 answers