I'm using spark-sql-2.4.1v, and I'm trying to do find quantiles, i.e. percentile 0, percentile 25, etc, on each column of my given data.
My data:
+----+---------+-------------+----------+-----------+--------+
| id| date| revenue|con_dist_1| con_dist_2| state |
+----+---------+-------------+----------+-----------+--------+
| 10|1/15/2018| 0.010680705| 6|0.019875458| TX |
| 10|1/15/2018| 0.006628853| 4|0.816039063| AZ |
| 10|1/15/2018| 0.01378215| 4|0.082049528| TX |
| 10|1/15/2018| 0.010680705| 6|0.019875458| TX |
| 10|1/15/2018| 0.006628853| 4|0.816039063| AZ |
| 10|1/15/2018| 0.01378215| 4|0.082049528| CA |
| 10|1/15/2018| 0.010680705| 6|0.019875458| CA |
| 10|1/15/2018| 0.006628853| 4|0.816039063| CA |
+----+---------+-------------+----------+-----------+--------+
I would get the states to calculate i.e
val states = Seq("CA","AZ");
val cols = Seq("con_dist_1" ,"con_dist_2")
for each given state I need to fetch data from source table and calculate percentiles only for the given columns.
I'm trying as below
for( state <- states){
for( col <- cols){
// pecentile calculation
}
}
this is too slow, when doing group by "state" wont get another columns like revenue, date and id.. how to get those?
How to find the quantiles on the columns "con_dist_1" & "con_dist_2" for each state? So what is the best way which scales well on cluster?
What is the best way to handle this use-case?
Expected result
+-----+---------------+---------------+---------------+---------------+---------------+---------------+
|state|col1_quantile_1|col1_quantile_2|col1_quantile_3|col2_quantile_1|col2_quantile_2|col2_quantile_3|
+-----+---------------+---------------+---------------+---------------+---------------+---------------+
| AZ| 4| 4| 4| 0.816039063| 0.816039063| 0.816039063|
| TX| 4| 6| 6| 0.019875458| 0.019875458| 0.082049528|
+-----+---------------+---------------+---------------+---------------+---------------+---------------+
You may have to do something similar to the below piece of code
may be the last set of withColumn should be inside a loop based on number of records for each state.
Hope this helps!