find id responsible for top 50 percent using running total

30 views Asked by At

I have a pyspark dataframe ticket_zip. It has an id column called mch_id. It also has a column called qty that indicates the qty associated with the id column mch_id on each record. I would like to get the first n mch_id ordered by qty descending that when summed account for 50% of the total qty, if I was to sum all the records in ticket_zip. I've been trying to use the spark sql code below with the window function to achieve this, but it seems to be including too many mch_id. when I filter ticket_zip by the mch_id returned by the code below and sum the qty, it's greater than 50%. when I run a ".show()" to inspect the first 20 records returned by the code, it appears to be working correctly calcing a running total. does anyone see what I'm doing wrong and can you suggest how to fix? My code and example data are below.

example data:

+------+---+
|mch_id|qty|
+------+---+
|2     |2  |
+------+---+
|2     |1  |
+------+---+
|5     |2  |
+------+---+
|1     |1  |
+------+---+
|4     |2  |
+------+---+
|7     |1  |
+------+---+

code:

ticket_zip.creeOrRepleTempView("ticket_zip")

top_1K_Cquery="""
with ticket_ste_90  (
select * from ticket_zip 
),
tot_qty  (
select sum(qty)  tot_qty from ticket_ste_90
),
top_1K_mch_id  (
select * from (
select
sum(qty) over(order by qty desc)  running_tot,
(sum(qty) over(order by qty desc))/(select * from tot_qty)  perc,
qty  qty_sold,
b.mch_id
from ticket_ste_90 
)
where perc <= 0.5
order by qty_sold desc

)

select * from top_1K_mch_id
"""

top_1K_Cdf=spk.sql(top_1K_Cquery)
0

There are 0 answers