Solutions for restarting Postgres when pg_wal full

2.6k views Asked by At

This question is a follow up on a previous question I posted about efficient ways to max up Postgres pg_wal directory (for training purposes) : How to bloat pg_wal?.

I'm now wondering what are the possible solutions for getting a Postgres server up and running once the partition is full because pg_wal has filled up.

I'm interesting in solutions that do not involve adding extra disk space to the partition.

Here are a the 2 other solutions I've come accross while discussing with colleagues :

  1. Move the entire pg_wal directory to another partition with sufficient available disk space and point to this new location from $PGDATA (ok this is kind of an additional disk space solution)
  2. Regain disk space by removing all WALs that have already been archived (a file with the same name and suffixed by .done should be present in pg_wal/archive_status)

Using pgBackRest I ran the archive-push command manually and then removed the WALs from pg_wal directory but I got the following error when starting Postgres :

2022-06-01 13:54:47 UTC [9334]: user=,db=,app=,client=LOG:  invalid primary checkpoint record
2022-06-01 13:54:47 UTC [9334]: user=,db=,app=,client=PANIC:  could not locate a valid checkpoint record

Obviously I've removed too many files, but I'm wondering if a clean solution based on the same idea could be used.

Question

Is there a way to clean up pg_wal without having to restart Postgres once it has shutdown because pg_wal has filled up ?

2

There are 2 answers

2
Laurenz Albe On BEST ANSWER

I'm interesting in solutions that do not involve adding extra disk space to the partition.

Hmm, then you are excluding the proper solution, which is exactly to increase the disk space on the WAL file system.

Your first solution (move pg_wal and put a symbolic link into the data directory) is entirely feasible. But as you say, that requires additional disk space, so why not extend the actual WAL file system instead?

Your second idea is not commendable. True, WAL segments that are marked as .done can be removed, but PostgreSQL will do that automatically at the next checkpoint anyway, so there should not be many of those lying around at any given time. And as you noticed, manually messing with the data directory is not a good idea; the danger of breaking your database is just too high.

Hands off from pg_resetwal. This executable, when run on a crashed data directory, will remove WAL and put PostgreSQL into a state where you can start it, but it will cause data corruption. pg_resetwal is intended as a desperate measure to get a corrupted server to start so you can salvage some of the data.

1
jjanes On

You probably don't need very much space to get it running again, so you should only need to remove a handful of files with corresponding '.done' entries, not all of them. Once the system is running it should go through and clean things up the rest of the way on its own. If you had removed just the oldest handful, it might not have caused a problem in the first place. But given that you have already removed too many, you should be able to manually copy the important ones back again from the archive to pg_wal.

You could instead delete (really, copy someplace else first, then delete) any recycled ahead WAL files, which were already archived and then were renamed to a future name in anticipation of reuse. The problem is how to identify them with high confidence on a crashed instance.

What I do is keep a file with a couple gig of random gibberish (to defeat transparent compression, if any is in use) as "ballast" someplace in the same partition as the data directory. Then if I run into such an emergency, I just delete the ballast file, then try to remember to recreate it later.