I asked a fairly similar yet different question and got a good response here:
Groupby and percentage distributions pyspark equivalent of given pandas code
I am not sure how to tailor the modification I need to do for my current need.
What I would like to do is to create a separate group by each of the percentiles and the aggregate for each of the percentiles to be a mean & median value of a certain fixed variable (it would be the same variable for every percentile). Here is the illustration for what I have in mind (just to make sure it's clear, those variables on the left currently exist on a variable level like "age", "income", I don't have them already created in advance with the percentiles, that's part of what I need to create.
mean(credit score) median(credit score)
age_10th_percentile 700 550
age_25th_percentile 710 560
age_50th_percentile 750 580
income_10th_percentile 710 590
income_25th_percentile 730 610
income_50th_percentile 740 640
The format of the output dataframe(s) won't be the same as what you have written in your question because pyspark doesn't really have the concept of an index. However, you can to first calculate the percentiles of each
col
, use this to bin your data accordingly, and then calculate the mean and median of any other columns using those bins.We start out with a sample pyspark dataframe that looks like the following:
Then we calculate the percentiles of each column, and use these values to assign buckets with a udf:
This gives us the following:
Then we can calculate metrics separately for any column based on the buckets of another column using
groupby
. For example below we calculate the mean and median ofcol2
based on the buckets ofcol1
: