I'm using spark-sql-2.4.1v, and I'm trying to do find quantiles, i.e. percentile 0, percentile 25, etc, on each column of my given data.
My data:
+----+---------+-------------+----------+-----------+--------+
| id| date| revenue|con_dist_1| con_dist_2| state |
+----+---------+-------------+----------+-----------+--------+
| 10|1/15/2018| 0.010680705| 6|0.019875458| TX |
| 10|1/15/2018| 0.006628853| 4|0.816039063| AZ |
| 10|1/15/2018| 0.01378215| 4|0.082049528| TX |
| 10|1/15/2018| 0.010680705| 6|0.019875458| TX |
| 10|1/15/2018| 0.006628853| 4|0.816039063| AZ |
| 10|1/15/2018| 0.01378215| 4|0.082049528| CA |
| 10|1/15/2018| 0.010680705| 6|0.019875458| CA |
| 10|1/15/2018| 0.006628853| 4|0.816039063| CA |
+----+---------+-------------+----------+-----------+--------+
I would get the states to calculate i.e
val states = Seq("CA","AZ");
val cols = Seq("con_dist_1" ,"con_dist_2")
for each given state I need to fetch data from source table and calculate percentiles only for the given columns.
I'm trying as below
for( state <- states){
for( col <- cols){
// pecentile calculation
}
}
this is too slow, when doing group by "state" wont get another columns like revenue, date and id.. how to get those?
How to find the quantiles on the columns "con_dist_1" & "con_dist_2" for each state? So what is the best way which scales well on cluster?
What is the best way to handle this use-case?
Expected result
+-----+---------------+---------------+---------------+---------------+---------------+---------------+
|state|col1_quantile_1|col1_quantile_2|col1_quantile_3|col2_quantile_1|col2_quantile_2|col2_quantile_3|
+-----+---------------+---------------+---------------+---------------+---------------+---------------+
| AZ| 4| 4| 4| 0.816039063| 0.816039063| 0.816039063|
| TX| 4| 6| 6| 0.019875458| 0.019875458| 0.082049528|
+-----+---------------+---------------+---------------+---------------+---------------+---------------+
UPDATE
I found the
percentile_approxfunction from the hive context, so you don't need to use thestatfunctions.Here, I tried the automated method for the given
statesandcols. The result will be;Be aware that the result is a bit different with your expected one because I set the
states = Seq("CA", "AZ")that is given by you.ORIGINAL
Use
Windowfor the states and calculate thepercent_rankfor each column.You may filter the dataframe first, only for specific states. Anyway, the result is: