pg_dump and pg_restore on giant databases

10.6k views Asked by At

I have currently a task to improve a database-structure. For this we want to effectively dump and restore one single giant database. (approx. 1TB and growing)

To test things with this database, we wanted to transfer this database to another server-node, and this via pg_dump and pg_restore.

We are running a v10 (https://www.postgresql.org/docs/10/app-pgdump.html) Server, so we are limited to their possible parameters. It is also required to dump the full database, and not only parts.

For this I tried a couple of approaches, these sources helped a lot:

and foremost:

The problem is, that you can almost only improve one of these task, but not both simultaneously.

Case 1

Dumping in directory format is extremely fast (~1 hour), but restoring is not.

pg_dump --blobs --dbname="$DBNAME" --file=$DUMPDIR --format=directory --host=$SERVERHOSTNAME --jobs=$THREADS --port=$SERVERPORT--username="$SERVERUSERNAME"
pg_restore --clean --create --format=directory --jobs=$THREADS --host=$SERVERHOSTNAME --port=$SERVERPORT --username="$SERVERUSERNAME" "./"

Problem about this restore-method is, even though I assigned multiple cores to it, it only uses one, with barely 4% CPU used on the server-core.

Case 2

Dumping in custom format is extremely slow, that the server even couldn't complete it overnight (Session timeout).

pg_dump --blobs --compress=9 --dbname="$dbname" --file="$DUMPDIR/db.dump" --format=custom --host=$SERVERHOSTNAME --port=$SERVERPORT --username=$SERVERUSERNAME

So I had different approaches in mind:

  1. dump it with approach #1, convert it afterwards (how?) and use a faster restore method (variant #2? )
  2. Creating multiple dumps simultaniously on different cores but with different schemas (Having a total of 6), and then merge them back (how?)

Piping seems to be an ineffective way of dumping according to the author stated above.

Does anyone have more experience in this? And are my approach-ideas useful, or do you have a complete different solution in mind?

Oh, before I forget: We are currently limited to 5TB on our external server, and the internal server which runs the db should not get bloated with data-fragments, even temporarily.

1

There are 1 answers

5
Laurenz Albe On

A parallel pg_restore with the directory format should speed up processing.

If it doesn't, I suspect that much of the data is in one large table, which pg_restore (and pg_dump) cannot parallelize.

Make sure you disable compression (-z 0) to improve the speed (unless you have a weak network).

You might be considerably faster with an online file system backup:

  • pg_basebackup is simple, but cannot be parallelized.

  • Using the low-level API, you can parallelize the backup with operating system or storage techniques.

The disadvantage is that with a file system backup, you can only copy the whole database cluster.