It's normal that my data loading in postgresql with partitions takes too long?

109 views Asked by At

I have a database that has to have approximately 20,000 records per day. From 2022 to 2024. I have created 6 partitions for each quarter of the years 2022 and 2023. In total there are 14 million records between 2022 and 2023. When loading 1 day's data with a python script, without partitions, it takes between 1-4 minutes, but when adding partitions it takes up to 10 minutes. This is normal? I'm afraid that adding more data will take much longer.

Is there any way to make this not take so long?

DDL principal table

CREATE TABLE public.orders (
    id varchar(20) NOT NULL,
    "orderType" varchar(20) NULL,
    description text NULL,
    "createdDate" timestamp NULL,
    samples json NULL,
    discarded bool NULL,
    "customerId" varchar(5) NULL,
    "originId" varchar(4) NULL,
    "labTestGroups" json NULL,
    prices json NULL,
    tax varchar(1) NULL,
    status varchar(20) NULL
)
PARTITION BY RANGE ("createdDate");
CREATE INDEX "orders_createdDate_idx" ON ONLY public.orders USING btree ("createdDate");

And the DDL of my partitions are like this =

CREATE TABLE public.orders2022_q1 PARTITION OF public.orders  FOR VALUES FROM ('2022-01-01 00:00:00') TO ('2022-04-01 00:00:00');
1

There are 1 answers

0
Zegarek On

asyncpg could do 1M rows/s between Python and PostgreSQL in 2016.

These days they advertise 2M/s.

Even if you stick to psycopg3 or psycopg2 I'm guessing you're using now, make sure you know there are multiple ways to do that: 14'000'000 records in 2022 is 38356 a day, and if loading that takes 1-4 minutes, you're loading 160-640 rows a second (half that if you meant that's how much you had for both 2022 and 2023). That sounds like you picked the slowest, row-by-row insert method - that score sort of matches test results in the link above. The reason you're slightly below that could mean each of your rows is heavier but also that you're using string interpolation instead of parameter binding, slowing things down further. Other, although less likely reasons these days, would be that

  • your description, "labTestGroups" and prices values are extremely long
  • your network throughput is holding you back
  • hardware resources on either end are exceptionally slow and scarce

Loading directly into a table will always be faster than routing records to multiple tables through the partitioning mechanism while maintaining indexes, but looking at your score I'd say you should be able to gain way more performance client-side.