how to quantile-discretize on spark?

503 views Asked by At

i want to quantile-discretize RDD[Float] to 10 pieces without Spark.ML, so i need to calculate 10th-Percentile, 20th-Percentile...80th-Percentile,90th-Percentile

data-set is very big, can't collect to local!

have any efficient algorithm to solve this problem?

1

There are 1 answers

0
dumitru On BEST ANSWER

There is already provided this capability is your are using Spark version > 2.0. You have to convert your RDD[Float] to a dataframe. Use approxQuantile(String col, double[] probabilities, double relativeError) from DataFrameStatFunctions. From the documentation is says:

This method implements a variation of the Greenwald-Khanna algorithm (with some speed optimizations). The algorithm was first present in Space-efficient Online Computation of Quantile Summaries by Greenwald and Khanna