I have PosgreSQL table salary
like:
| the_date | ... | position | ... | ... | salary | ... |
| -------- | --- | ---------- | --- | --- | ---- | --- |
| 2024-02-19 | ... | position_A | ... | ... | 2200 | ... |
| 2024-02-19 | ... | position_B | ... | ... | 1890 | ... |
| 2024-02-19 | ... | position_B | ... | ... | 1750 | ... |
| 2024-02-20 | ... | position_C | ... | ... | 3000 | ... |
| 2024-02-20 | ... | position_C | ... | ... | 4000 | ... |
| 2024-02-20 | ... | position_A | ... | ... | 1500 | ... |
| 2024-02-21 | ... | position_A | ... | ... | 1000 | ... |
| 2024-02-21 | ... | position_B | ... | ... | 9000 | ... |
| 2024-02-22 | ... | position_C | ... | ... | 850 | ... |
| 2024-02-22 | ... | position_A | ... | ... | 9000 | ... |
etc...
I currently use approach via pandas as:
def is_outlier(s):
# lower_limit = s.mean() - (s.std() * 3)
# upper_limit = s.mean() + (s.std() * 3)
lower_limit = s.median() - (s.quantile(0.25))
upper_limit = s.median() + (s.quantile(0.75))
return ~s.between(lower_limit, upper_limit)
...
...
sql_query = f"SELECT position, salary FROM salary WHERE the_date >= '{get_from_date()}';" # like '2023-08-28' -> today - 180 days
start = time()
df = pd.read_sql(sql=sql_text(sql_query), con=engine_salary.connect())
end = time()
print(f"pull data for salary_by_position {end - start} sec.")
df_filtered = df[~df.groupby('position', group_keys=False)['salary'].apply(is_outlier)]
df_id_grouped = df_filtered.groupby(by="position")["salary"].median().apply(lambda x: round(x, 0)).sort_values(
ascending=True
)
# or
df_id_grouped = df_filtered.groupby(by="position")["salary"].mean().apply(lambda x: round(x, 0)).sort_values(
ascending=True
)
... it works fine, but in this approach I need to load about 30k rows, download them from DB to the application and then I can calculate what I need... But I think a better approach might be to calculate the average/median salary for each position with no outliers on the DB side (using SQL) and only return 600 rows as the salary for the position
But I have no idea how I can solve this.
This query returns me the average salary for the position, but with outlier
select position, round(avg(cast(salary as numeric)), 0) as avg_salary
from salary
where the_date >= '2023-08-26'
group by position
order by avg_salary DESC
How can I extend the sql query to get the same result that the pandas calculation gives me?
SELECT s.position, round(avg(cast(s.salary as numeric)), 2) AS avg_salary
FROM (
SELECT s.* (
CASE WHEN salary >= median(salary) OVER (PARTITION BY position) - 0.25 * STDDEV(salary) OVER (PARTITION BY position) AND
salary <= median(salary) OVER (PARTITION BY position) + 0.25 * STDDEV(salary) OVER (PARTITION BY position)
THEN 'N' ELSE 'Y'
END
) as outlier
) s
WHERE the_date >= '2023-08-26' and outlier = 'N'
GROUP BY position;
but this query doesn't work
also:
SELECT s.position, s.salary, s.Sal
FROM (
SELECT position, salary, NTILE(100) OVER (ORDER BY salary) AS Sal
FROM salary
WHERE the_date > '2023-08-26'
) s
WHERE s.Sal BETWEEN 25 AND 75
GROUP BY s.position
ORDER BY s.salary DESC
but this query doesn't work
I think
percent_rank
will work for you: