GIST index or other alternative for indexing JSONB columns in Postgres

203 views Asked by At

Context:

  • Postgres 15
  • table hundreds of millions or maybe billions of rows, the tables contain an id colummn and a JSONB column
  • no other constraints (FKs)
  • selecting from the table is usually/often happens by searching for content of JSON fields
  • one JSONB cell is around 5 kB in our case
  • not using Postgres is out of the question

The problem: Sometime the table is under heavy upsert pressure (INSERT ON CONFLICT), in these cases we need high throughput, so we'd like to use concurrent upserts. The issue is that in order to facilitate searching in the table, we use a GIN index on the JSONB column. Maintaining this index seems to put limits on how much we can scale upsert performance.

  • By default, Postgres is using fastupdate to allow fast upserts and defers GIN maintenance until a bunch of non-indexed tuples accumulate. The default setting for this is 4 MB. We see that the p50 and p95 upsert duration is <15 msec, but under high load every 20-40 seconds we see a 10 second pause during which nothing can insert into the table. We think this is due GIN index maintenance, i.e. processing pending non-indexed tuples. Throughput averages out to ~ 80-100 upserts/second independent of thread count.
  • If we turn fastupdate=off then things are better; upserts get somewhat slower but under high load there's still a limit on concurrency, so we can't scale up upserts to above 2-3, which limits the throughput to ~100 upserts or so.

It is OK for us if reads from this table become a bit slower, so I was thinking that we could use a GIST index instead, but I don't know how to do it properly.

> create index concurrently gist_idx on thetablename using gist(thejsoncolumn);
ERROR:  data type jsonb has no default operator class for access method "gist"
HINT:  You must specify an operator class for the index or define a default operator class for the data type.

AFAIK GIST index maintenance is also more concurrency-friendly as upserts into a table with GIN need to lock multiple rows in the index.

Questions:

  • does my GIST idea look sensible?
  • how should I create a GIST index on the JSONB column (both keys and values are searched for, so something similar to json_ops would be OK)
  • does someone have experience with how fastupdate=off behaves under concurrent load?
  • maybe we should set gin_pending_list_limit to something very high (gigabytes) then let autovacuum process it in the background and until then we'd get lower SELECT performance?

Thanks.

1

There are 1 answers

2
jjanes On

GiST is specialized for data which can overlap in a contiguous-like fashion, like geometries. Not data which can overlap in a scattered fashion, like "Does this contain the sequence of letters 'dasdf' somewhere within it". So a GiST index on JSONB would need to use some highly folded signature-based approach, and those tend to suck for data with large element counts (like 5kB of JSONB). Which is probably why no one has bothered to implement GiST for that purpose. GIN is really the correct index method for this purpose.

We see that the p50 and p95 upsert duration is <15 msec, but under high load every 20-40 seconds we see a 10 second pause during which nothing can insert into the table

This should not happen. One inserting process will get assigned to clean the pending list, and that one process will stall while it does so. But concurrent inserting processes should not block while it happens, they should be free to proceed. I guess what could be happening is if the concurrent processes are largely working on the same unique key value used in the ON CONFLICT, then maybe one process stalls while cleaning the list, and the other ones block waiting for the conflict involving that first stalled one to resolve. But this should be rare unless you have some perverse pattern to your data insertions. On the other hand, maybe other processes don't really stall, they just slow down due to IO congestion. Can you use pg_stat_activity.wait_event (and maybe pg_locks) to identify what is actually going on?

If we turn fastupdate=off then things are better; upserts get somewhat slower but under high load there's still a limit on concurrency, so we can't scale up upserts to above 2-3, which limits the throughput to ~100 upserts or so.

Again, look in pg_stat_activity to see what is going on. It could just be IO saturation, or maybe there is something else going on which extends beyond merely being a gin index with fastupdate off. But we would need to know what that something else is.

maybe we should set gin_pending_list_limit to something very high (gigabytes) then let autovacuum process it in the background and until then we'd get lower SELECT performance?

You can use the gin_clean_pending_list function to clean the pending list without having to do all the rest of the stuff that vacuum does. You would have to create your own background process which invokes this function periodically, as there is no built in scheduler for it. This can give the best of both worlds, you can do it often enough to keep the list small, and it happens in the background, (hopefully) without blocking any foreground tasks.