My data looks like this:
idx,year,month,day,metadata,not_impt,metricx
123,2022,12,02,"blah blah","lah lah",-123.94
123,2022,11,05,"blah blah asd","lah lah",62.4
123,2022,12,03,"blah blah asd","lah lah",39.512
123,2022,12,09,"blah blah","lah lah",12.412
123,2022,11,19,"blah blah","lah lah",24.43
123,2022,11,26,"blah blahac ","lah lah",94.94
987,2022,12,12,"blah blah","lah lah",-23.94
987,2022,11,15,"blah blahvs","lah lah",42.4
987,2022,11,03,"blah blah","lah lah",32.512
987,2022,12,04,"blah blah kams","lahada lah",19.412
987,2022,12,19,"blah blah","lah lah",21.43
987,2022,11,26,"blah blah","lah lah",74.94
And they are already read into an Athena view tablex and there's a real number column, metricx that I'll like to compute some statistics in a new view, the values ranges from [-500, +500].
The goal is to create a new view,
- group by
idxandyearandmonth avg_metrix: compute the average value ofmetricxin each group-bynorm_metric: compute the min-max normalized value of the averagedmetricxerrorbar_top_metricanderrorbar_bottom_metric: compute the error bar top and bottom values for the 95% confidence interval,
I've tried the follow and it works but I'm not re-using the values I've aggregated when computing the error bar values:
CREATE VIEW AS
SELECT idx,
concat(cast(year as varchar), '-', cast(month as varchar)) as date,
count(*) as num_rows,
AVG(metricx) as avg_metric,
((AVG(metricx) - MIN(metricx)) / (MAX(metricx) - MIN(metricx))) as norm_metric,
(
((AVG(metricx) - MIN(metricx)) / (MAX(metricx) - MIN(metricx))) -
(STDDEV_POP(metricx) / SQRT(count(*))) * 1.96)
) as errorbar_bottom_metric,
(
((AVG(metricx) - MIN(metricx)) / (MAX(metricx) - MIN(metricx))) +
(STDDEV_POP(metricx) / SQRT(count(*))) * 1.96)
) as errorbar_top_metric,
FROM tablex
GROUP BY idx, year, month
Although the SQL works, there's a few repetitions when I'm not reusing the values computed in the previous columns.
Given that the data size is not that big < 100,000 rows, is there an cleaner way to write the SQL without copy and pasting the computation of the previous columns?
Maybe, nested SELECT query?
You can talk about readability if you have an idea about the reader.
I think the nested query will make it more understandable for most readers. Reviewing the calculations step by step shows the purpose and result of the calculations.
Execution plan for MySql 8.0 with create index ix_id_dt on tablex (idx,year,month);
There is nothing to say here without knowing the structure and size of the data.
Explain for query above
For your current query