Aggregating statistics into JSON in Postgresql

415 views Asked by At

So I am trying to calculate overview statistics into JSON, but am having trouble wrangling them into a query.

There are 2 tables:

appointments
- time timestamp
- patients int


assignments
- user_id int
- appointment_id int

I want to calculate the number of patients by user, by hour for the day. Ideally, it would look like this:

[ 
  {hour: "2015-07-01T08:00:00.000Z", assignments: [
    {user_id: 123, patients: 3}, 
    {user_id: 456, patients: 10}, 
    {user_id: 789, patients: 4},
  ]},
  {hour: "2015-07-01T09:00:00.000Z", assignments: [
    {user_id: 456, patients: 1},
    {user_id: 789, patients: 6}
  ]},
  {hour: "2015-07-01T10:00:00.000Z", assignments: []}
  ...
]

I got kind of close:

with assignments_totals as (
    select user_id,sum(patients),date_trunc('hour',appointments.time) as hour
    from assignments
    inner join appointments on appointments.id = assignments.appointment_id
    group by date_trunc('hour',sales.time),user_id
  ), hours as (
    select to_char(date_trunc('hour',time),'YYYY-MM-DD"T"HH24:00:00.000Z') as hour, array_to_json(array_agg(DISTINCT assignment_totals)) as patients
    from appointments 
    left join assignment_totals on date_trunc('hour',sales.time) = assignment_totals.hour
    where time >= '2015-07-01T07:00:00.000Z' and time < '2015-07-02T07:00:00.000Z' 
    group by date_trunc('hour',time)
    order by date_trunc('hour',time) 
  )
  select array_to_json(array_agg(hours)) as hours from hours;

Which outputs:

[ 
  {hour: "2015-07-01T08:00:00.000Z", assignments: [
    {user_id: 123, patients: 3, hour: "2015-07-01T08:00:00.000Z" }, 
    {user_id: 456, patients: 10, hour: "2015-07-01T08:00:00.000Z"}, 
    {user_id: 789, patients: 4, hour: "2015-07-01T08:00:00.000Z"},
  ]},
  {hour: "2015-07-01T09:00:00.000Z", assignments: [
    {user_id: 456, patients: 1, hour: "2015-07-01T09:00:00.000Z"},
    {user_id: 789, patients: 6, hour: "2015-07-01T09:00:00.000Z"}
  ]},
  {hour: "2015-07-01T10:00:00.000Z", assignments: [null]}
  ...
]

While this works, there are 2 issues, which may or may not be independent of each other:

  1. If there are no appointments that hour, I still want the hour to be included in the array (like 10AM in the example), but to have an empty "assignments" array. Right now it puts a null in there, and I can't figure out how to get rid of it while still keeping the hours in there.
  2. I have to have the hour included in the assignments entries along with user_id and appointments because I need it to join the assignments_totals query to the hours query. But it's unnecessary because it's already in the parent.
  3. I feel like it should be able to be done in 1 cte and 1 query and now I'm using 2 cte's... but can't figure out how to condense it and make it work.

I wanted to do something like

  hours as (
    select to_char(date_trunc('hour',time),'YYYY-MM-DD"T"HH24:00:00.000Z') as hour, sum(appointments.patients) OVER(partition by assignments.user_id) as appointments
    from appointments 
    left join assignments on appointments.id = assignments.appointment_id
    where time >= '2015-07-01T07:00:00.000Z' and time < '2015-07-02T07:00:00.000Z'  
    group by date_trunc('hour',time)
    order by date_trunc('hour',time) 
  )
  select array_to_json(array_agg(hours)) as hours from hours

but i can't get it to work without giving me a "attribute must be in the group by or aggregate function error.

Anyone know how to fix any of these issues? Thanks in advance!

2

There are 2 answers

1
Andrew On

The main issue with your last query seems to be in conflating window functions with aggregate functions. Window functions use the OVER syntax, and they do not in themselves require GROUP BY when there are other fields in the SELECT clause. Aggregate functions, on the other hand, use GROUP BY when there are other (non-aggregate-function) fields in the SELECT clause. One practical consequence of this difference is that window functions are not automatically DISTINCT.

The issue with NULL values resulting from the window function can be resolved with a simple COALESCE such that zero is used instead of null.

So, to write your query using a window function, use something like:

WITH hours AS
(
    SELECT DISTINCT to_char(date_trunc('hour', ap.time), 'YYYY-MM-DD"T"HH:00:00.000Z') AS hour,
           COALESCE(SUM(ap.patients) OVER (PARTITION BY asgn.user_id), 0) AS appointment_count
    FROM   appointments ap
    LEFT JOIN assignments asgn ON ap.id = asgn.appointment_id
    WHERE  ap.time >= '2015-07-01T07:00:00.000Z'
    AND    ap.time < '2015-07-02T07:00:00.000Z'
)
SELECT array_to_json(array_agg(hours)) AS hours
FROM   hours
ORDER BY hour

With an aggregate function:

WITH hours AS
(
    SELECT to_char(date_trunc('hour', ap.time), 'YYYY-MM-DD"T"HH:00:00.000Z') AS hour,
           SUM(COALESCE(ap.patients, 0)) AS appointment_count,
           asgn.user_id
    FROM   appointments ap
    LEFT JOIN assignments asgn ON ap.id = asgn.appointment_id
    WHERE  ap.time >= '2015-07-01T07:00:00.000Z'
    AND    ap.time < '2015-07-02T07:00:00.000Z'
    GROUP BY asgn.user_id, to_char(date_trunc('hour', ap.time), 'YYYY-MM-DD"T"HH:00:00.000Z')
)
SELECT array_to_json(array_agg(hours)) AS hours
FROM   hours
ORDER BY hour

My syntax may not be quite correct, so double-check before using this solution or one like it (and feel free to edit to correct any errors).

0
rurabe On

Most of my frustration with this came because I was not looking at the Postgres 9.4 documentation, which has new functions for dealing with json.

The solution I found builds upon the original query, but then breaks the assignments array down using json_array_elements, filters using where, then builds it back up again. It seems pointless to have essentially:

json_agg(json_array_elements(json_agg(*)))

But it makes very little performance difference and gets me where I need to go. Feel free to comment if you find a better solution! It should also be possible in <9.4 using array_agg and unnest but I was having trouble because I was trying to unnest a record type returned from my CTE, instead of an actual row type with column definitions.

  with assignment_totals as (
    select 
      date_trunc('hour',appointments.time) as hour, 
      user_id, 
      coalesce(sum(patients),0) as patients
    from appointments
    left outer join assignments on appointment.id = assignments.appointment_id
    where time >= '2015-07-01T07:00:00.000Z' and time < '2015-07-02T07:00:00.000Z' 
    group by date_trunc('hour',appointments.time),user_id
  ), hours as (
    select 
      to_char(assignment_totals.hour,'YYYY-MM-DD"T"HH24:00:00.000Z') as hour,
      (
        select coalesce(json_agg(json_build_object('user_id',(t->'user_id'),'patients',(t->'patients')) order by (t->>'user_id')),'[]'::json) 
        from json_array_elements(json_agg(assignment_totals)) t 
        where (t->>'patients') != '0'
      ) as patients
    from assignment_totals 
    group by assignment_totals.hour
    order by assignment_totals.hour
  )
  select array_to_json(array_agg(hours)) as hours from hours

Thanks to Andrew for pointing out that I can coalesce nulls to 0. But I still want to filter out entries where patients = 0. This solves all my problems by giving me the ability to filter them out with a where, and then gives me the ability to take out the time by building a new json object with json_build_object.