I wrote a query that returns a bar graph in the terminal using Postgres CLI. The query is slow and inefficient. I would like to change that.
At the base, we have a pretty simple query. We want each row to be a division of the total number of rows in our table. Let's say that our hardcoded number of rows is N_ROWS, and our table is my_table.
Also, let's say N_ROWS equals 8.
select
(select count(id) from my_table) / N_ROWS * (N_ROWS - num) as level
from (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8)) as t (num)
In my case, this returns my Y-Axis of the chart as:
level
-------
71760
62790
53820
44850
35880
26910
17940
8970
0
You can see the issues with that query already.
Can I programmatically generate a number of rows using N_ROWS and not hardcode each row value in VALUES? I also don't like how I perform a new count over my whole table for each row, obviously.
We now need our X-Axis, and this is what I came up with:
select
r.level,
case
when (
select count(id) from my_table where created_at_utc<= '2019-01-01 00:00:00'::timestamp without time zone
) >= r.level then true
end as "2019-01-01"
from (
select (select count(id) from my_table) / N_ROWS * (N_ROWS - num) as level from (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8)) as t (num)
) as r;
Which returns our first bucket:
level | 2019-01-01
-------+------------
71760 |
62790 |
53820 |
44850 |
35880 |
26910 | t
17940 | t
8970 | t
0 | t
I'd rather not hardcode a case statement for each bucket, but, of course, that's what I did. The results are what I was looking for.
level | 2019-01-01 | 2019-02-01 | 2019-03-01 | 2019-04-01 | 2019-05-01 | 2019-06-01 | 2019-07-01 | 2019-08-01 | 2019-09-01 | 2019-10-01 | 2019-11-01 | 2019-12-01
-------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------
71760 | | | | | | | | | | | | t
62790 | | | | | t | t | t | t | t | t | t | t
53820 | | | | t | t | t | t | t | t | t | t | t
44850 | | | t | t | t | t | t | t | t | t | t | t
35880 | | t | t | t | t | t | t | t | t | t | t | t
26910 | t | t | t | t | t | t | t | t | t | t | t | t
17940 | t | t | t | t | t | t | t | t | t | t | t | t
8970 | t | t | t | t | t | t | t | t | t | t | t | t
0 | t | t | t | t | t | t | t | t | t | t | t | t
There are certainly a few improvements we can make.
First, let's make a test table with some data:
We can almost substitute your first query to find the levels with this much faster query:
Note that the first level is 1 instead of 0, but the rest should be the same.
There are a few other tricks that we can employ:
That query ran in 40ms on my laptop.
The dates could be selected from the max and min of the dates in the test table and the interval could be changed from 1 day depending on how many columns are desired between the max and min.