Up-to-date dictionary of distinct values for column

112 views Asked by At

I have a table with many columns and several million rows like

CREATE TABLE foo (
id integer,
thing1 text,
thing2 text,
...
stuff text);

How can I manage the relevance of dictionary of unique values of stuff column that originally populates like this:

INSERT INTO stuff_dict SELECT DISTINCT stuff from foo;

Should I manually synchronize (check if new stuff value already in stuff_dict before every insert/update) or use triggers for each insert/update/delete from foo table. In latter case, what is the best design for such a trigger(s)?

UPDATE: view does not suit here, because the SELECT * FROM stuff_dict should run as fast as possible (even CREATE INDEX ON foo(stuff) does not help much when foo has tens of millions of records).

1

There are 1 answers

1
klin On

A materialized view seems to be the simplest option for a large table.

In the trigger function just refresh the view. You can use concurrently option (see the pozs's comment below).

create materialized view stuff_dict as 
    select distinct stuff
    from foo;

create or replace function refresh_stuff_dict()
returns trigger language plpgsql
as $$
begin
    refresh materialized view /*concurrently*/ stuff_dict;
    return null;
end $$;

create trigger refresh_stuff_dict
after insert or update or delete or truncate
on foo for each statement 
execute procedure refresh_stuff_dict();

While the solution with a materialized view is straightforward it may be suboptimal when the table foo is modified frequently. In this case use a table for the dictionary. An index will be helpful.

create table stuff_dict as 
    select distinct stuff
    from foo;

create index on stuff_dict(stuff);

The trigger function is more complicated and should be fired for each row after insert/update/delete:

create or replace function refresh_stuff_dict()
returns trigger language plpgsql
as $$
declare
    do_insert boolean := tg_op = 'INSERT' or tg_op = 'UPDATE' and new.stuff <> old.stuff;
    do_delete boolean := tg_op = 'DELETE' or tg_op = 'UPDATE' and new.stuff <> old.stuff;
begin
    if do_insert and not exists (select 1 from stuff_dict where stuff = new.stuff) then
        insert into stuff_dict values(new.stuff);
    end if;
    if do_delete and not exists (select 1 from foo where stuff = old.stuff) then
        delete from stuff_dict
        where stuff = old.stuff;
    end if;
    return case tg_op when 'DELETE' then old else new end;
end $$;

create trigger refresh_stuff_dict
after insert or update or delete
on foo for each row
execute procedure refresh_stuff_dict();