i want to quantile-discretize RDD[Float] to 10 pieces without Spark.ML, so i need to calculate 10th-Percentile, 20th-Percentile...80th-Percentile,90th-Percentile
data-set is very big, can't collect to local!
have any efficient algorithm to solve this problem?
There is already provided this capability is your are using Spark version > 2.0. You have to convert your RDD[Float] to a dataframe. Use
approxQuantile(String col, double[] probabilities, double relativeError)
fromDataFrameStatFunctions
. From the documentation is says: