I have a pyspark dataframe ticket_zip. It has an id column called mch_id. It also has a column called qty that indicates the qty associated with the id column mch_id on each record. I would like to get the first n mch_id ordered by qty descending that when summed account for 50% of the total qty, if I was to sum all the records in ticket_zip. I've been trying to use the spark sql code below with the window function to achieve this, but it seems to be including too many mch_id. when I filter ticket_zip by the mch_id returned by the code below and sum the qty, it's greater than 50%. when I run a ".show()" to inspect the first 20 records returned by the code, it appears to be working correctly calcing a running total. does anyone see what I'm doing wrong and can you suggest how to fix? My code and example data are below.
example data:
+------+---+
|mch_id|qty|
+------+---+
|2 |2 |
+------+---+
|2 |1 |
+------+---+
|5 |2 |
+------+---+
|1 |1 |
+------+---+
|4 |2 |
+------+---+
|7 |1 |
+------+---+
code:
ticket_zip.creeOrRepleTempView("ticket_zip")
top_1K_Cquery="""
with ticket_ste_90 (
select * from ticket_zip
),
tot_qty (
select sum(qty) tot_qty from ticket_ste_90
),
top_1K_mch_id (
select * from (
select
sum(qty) over(order by qty desc) running_tot,
(sum(qty) over(order by qty desc))/(select * from tot_qty) perc,
qty qty_sold,
b.mch_id
from ticket_ste_90
)
where perc <= 0.5
order by qty_sold desc
)
select * from top_1K_mch_id
"""
top_1K_Cdf=spk.sql(top_1K_Cquery)