Еfficient method of a large table parallel reading in PostgreSQL

Question

Еfficient method of a large table parallel reading in PostgreSQL

41 views Asked by Vitaly Isaev At 19 December 2023 at 07:48

Consider a situation when there is a really large table (up to serveral TB), and for a some reason an application wants to scan this table full. There are multiple CPU cores both on the application and PostgreSQL server side, so it brings us to the idea of separation of a table into multiple parts and reading them concurrently in parallel threads. What would be the best way to achieve that?

Does it make sense to emulate table "pagination" with OFFSET / LIMIT like this?

SELECT * FROM table LIMIT 1000000 OFFSET 0
SELECT * FROM table LIMIT 1000000 OFFSET 1000000
SELECT * FROM table LIMIT 1000000 OFFSET 2000000
(...)

so that each SELECT will be executed in a distinct application thread.

Or maybe it worth trying using PostgreSQL native table partitioning, so each table partition (a distinct table in fact) would be read independently?

Another important point is consistency. Surely table may change during the scanning. Does PostgreSQL provide any kind of transactions for independent SELECTs to make them work with a consistent view of a table without data brought in by later requests?

Original Q&A

There are 1 answers

**Schwern** · Answer 1 · 2023-12-19T08:05:17+00:00

PostgreSQL has support for parallel queries.

PostgreSQL can devise query plans that can leverage multiple CPUs in order to answer queries faster. This feature is known as parallel query.

But parallelization isn't a magic bullet.

Many queries cannot benefit from parallel query, either due to limitations of the current implementation or because there is no imaginable query plan that is any faster than the serial query plan. However, for queries that can benefit, the speedup from parallel query is often very significant. Many queries can run more than twice as fast when using parallel query, and some queries can run four times faster or even more. Queries that touch a large amount of data but return only a few rows to the user will typically benefit most.

I'm not going to go into detail, it's all in the docs about how to configure Postgres to use parallel workers and do parallel seq scans.

At a certain point, it doesn't matter how many CPUs you have, you're doing to run into the limits of disk I/O. Even the best SSDs will only read 10 GB/s, several terabytes will take at least a few minutes. It would be better to question whether you can avoid having to seq scan the table.

Or maybe it worth trying using PostgreSQL native table partitioning, so each table partition (a distinct table in fact) would be read independently?

Possibly, but table partitioning is also not a magic performance bullet. Without more details I can't say.

A 3rd option is to move this massive blob of data into a data warehouse optimized for these sorts of big queries.

TechQA.

Еfficient method of a large table parallel reading in PostgreSQL

There are 1 answers

Related Questions in POSTGRESQL

Related Questions in PARALLEL-PROCESSING

Related Questions in BIGDATA

Related Questions in PARTITIONING

Related Questions in DATABASE-PARTITIONING

Popular Questions

Trending Questions