TechQA.

How to do distributed Principal Components Analysis + Kmeans using Apache Spark?

849 views Asked by Edward J. Stembler At 10 June 2015 at 13:25

I need to run Principal Components Analysis and K-means clustering on a large-ish dataset (around 10 GB) which is spread out over many files. I want to use Apache Spark for this since it's known to be fast and distributed.

I know that Spark supports PCA and also PCA + Kmeans.

However, I haven't found an example which demonstrates how to do this with many files in a distributed manner.

There are 0 answers