How can I traverse a tree bottom-up to calculate a (weighted) average of node values in PostgreSQL?

2.4k views Asked by At

The typical example for e.g. summing a whole tree in PostgreSQL is using WITH RECURSIVE (Common Table Expressions). However, these examples typically go from top to bottom, flatten the tree and perform an aggregate function on the whole result set. I have not found a suitable example (on StackOverflow, Google, etc.) for the problem I am trying to solve:

Consider an unbalanced tree where each node can have an associated value. Most of the values are attached to leaf nodes, but the others may have values as well. If a node (leaf or not) has an explicitly attached value, this value can be directly used without further calculation (subtree can be ignored, then). If the node has no value, the value should be computed as the average of its direct children.

However, as none of the nodes are guaranteed to have a value attached, I need to go bottom up in order to obtain a total average. In a nutshell, starting from the leafs, I need to apply AVG() to each set of siblings and use this (intermediate) result as value for the parent node (if it has none). This parent's (new) value (explicitly attached, or the average of its children) is in turn used in the calculation of average values at the next level (the average value of the parent and its siblings).

Example situation:

A
+- B (6)
+- C
   +- D
      +- E (10)
      +- F (2)
+- H (18)
   +- I (102)
   +- J (301)

I need to compute the average value for A, which should be 10 (because (6+6+18)/3 = 10 and I,J are ignored).

1

There are 1 answers

1
klin On BEST ANSWER

Your data can be stored as:

create table tree(id int primary key, parent int, caption text, node_value int);
insert into tree values
(1, 0, 'A', null),
(2, 1, 'B', 6),
(3, 1, 'C', null),
(4, 3, 'D', null),
(5, 4, 'E', 10),
(6, 4, 'F', 2),
(7, 1, 'H', 18),
(8, 7, 'I', 102),
(9, 7, 'J', 301);

The simplest way to do bottom-up aggregation is a recursive function.

create or replace function get_node_value(node_id int)
returns int language plpgsql as $$
declare
    val int;
begin
    select node_value
    from tree 
    where id = node_id
    into val;
    if val isnull then
        select avg(get_node_value(id))
        from tree
        where parent = node_id
        into val;
    end if;
    return val;
end;
$$;

select get_node_value(1);

 get_node_value 
----------------
             10
(1 row)

Test it here.

It is possible to achieve the same in an sql function. The function code is not so obvious but it may be a bit faster than plpgsql.

create or replace function get_node_value_sql(node_id int)
returns int language sql as $$
    select coalesce(
        node_value,
        (
            select avg(get_node_value_sql(id))::int
            from tree
            where parent = node_id
        )
    )
    from tree 
    where id = node_id;
$$;

Viewing a tree from the bottom up using cte is not especially complicated. In this particular case the difficulty lies in the fact that average should be computed for each level separately.

with recursive bottom_up(id, parent, caption, node_value, level, calculated) as (
    select 
        *, 
        0, 
        node_value calculated
    from tree t
    where not exists (
        select id
        from tree
        where parent = t.id)
union all
    select 
        t.*, 
        b.level+ 1,
        case when t.node_value is null then b.calculated else t.node_value end
    from tree t
    join bottom_up b on t.id = b.parent
)

select id, parent, caption, avg(calculated)::int calculated
from (
    select id, parent, caption, level, avg(calculated)::int calculated
    from bottom_up
    group by 1, 2, 3, 4
    ) s
group by 1, 2, 3
order by 1;

Test it here.