How to translate a pandas calculation to SQL (group by position average salary without outliers)

69 views Asked by At

I have PosgreSQL table salary like:

| the_date    | ... | position   | ... | ... | salary | ... |  
| --------    | --- | ---------- | --- | --- | ----   | --- |
| 2024-02-19  | ... | position_A | ... | ... | 2200   | ... |
| 2024-02-19  | ... | position_B | ... | ... | 1890   | ... |
| 2024-02-19  | ... | position_B | ... | ... | 1750   | ... |
| 2024-02-20  | ... | position_C | ... | ... | 3000   | ... |
| 2024-02-20  | ... | position_C | ... | ... | 4000   | ... |
| 2024-02-20  | ... | position_A | ... | ... | 1500   | ... |
| 2024-02-21  | ... | position_A | ... | ... | 1000   | ... |
| 2024-02-21  | ... | position_B | ... | ... | 9000   | ... |
| 2024-02-22  | ... | position_C | ... | ... | 850    | ... |
| 2024-02-22  | ... | position_A | ... | ... | 9000   | ... |

etc...

I currently use approach via pandas as:


def is_outlier(s):
    # lower_limit = s.mean() - (s.std() * 3)
    # upper_limit = s.mean() + (s.std() * 3)
    lower_limit = s.median() - (s.quantile(0.25))
    upper_limit = s.median() + (s.quantile(0.75))
    return ~s.between(lower_limit, upper_limit)

...
...

sql_query = f"SELECT position, salary FROM salary WHERE the_date >= '{get_from_date()}';" # like '2023-08-28' -> today - 180 days
start = time()
df = pd.read_sql(sql=sql_text(sql_query), con=engine_salary.connect())
end = time()
print(f"pull data for salary_by_position {end - start} sec.")

df_filtered = df[~df.groupby('position', group_keys=False)['salary'].apply(is_outlier)]

df_id_grouped = df_filtered.groupby(by="position")["salary"].median().apply(lambda x: round(x, 0)).sort_values(
            ascending=True
        )

# or
df_id_grouped = df_filtered.groupby(by="position")["salary"].mean().apply(lambda x: round(x, 0)).sort_values(
            ascending=True
        )

... it works fine, but in this approach I need to load about 30k rows, download them from DB to the application and then I can calculate what I need... But I think a better approach might be to calculate the average/median salary for each position with no outliers on the DB side (using SQL) and only return 600 rows as the salary for the position

But I have no idea how I can solve this.

This query returns me the average salary for the position, but with outlier

select position, round(avg(cast(salary as numeric)), 0) as avg_salary
from salary
where the_date >= '2023-08-26'
group by position
order by avg_salary DESC

How can I extend the sql query to get the same result that the pandas calculation gives me?

SELECT s.position, round(avg(cast(s.salary as numeric)), 2) AS avg_salary
FROM (
    SELECT s.* (
        CASE WHEN salary >= median(salary) OVER (PARTITION BY position) - 0.25 * STDDEV(salary) OVER (PARTITION BY position) AND
                  salary <= median(salary) OVER (PARTITION BY position) + 0.25 * STDDEV(salary) OVER (PARTITION BY position)
            THEN 'N' ELSE 'Y'
        END
    ) as outlier
) s
WHERE the_date >= '2023-08-26' and outlier = 'N'
GROUP BY position;

but this query doesn't work

also:

SELECT s.position, s.salary, s.Sal
FROM ( 
    SELECT position, salary, NTILE(100) OVER (ORDER BY salary) AS Sal
    FROM salary
    WHERE the_date > '2023-08-26'
) s
WHERE s.Sal BETWEEN 25 AND 75
GROUP BY s.position
ORDER BY s.salary DESC

but this query doesn't work

1

There are 1 answers

0
Metin AKTAŞ On

I think percent_rank will work for you:

    SELECT s.position, s.salary, s.percent
    FROM ( 
        SELECT position, salary, PERCENT_RANK() OVER(PARTITION BY position ORDER BY salary) AS percent
        FROM salary
        WHERE the_date > '2023-08-26'
    ) s
    WHERE s.percent BETWEEN 0.25 AND 0.75
    ORDER BY s.salary DESC