Transferring data between two PostgreSQL servers

212 views Asked by At

I have two PostgreSQL servers, one centralized, and one peripheral. The idea is for the peripheral server to accumulate data from various processes running on the same machine. Periodically (Maybe hourly, or thereabouts), the data is shipped from the peripheral server and into the centralized server. After confirmation of a successful server, the peripheral server will wipe its content to keep its implemantation as lightweight as possible.

While initially this is a fairly straight forward scenario, of which there are many existing sollutions, I am sure, There are a few factors that limit my options:

  • The link between the peripheral and centralized servers may be down for days at a time, in which case it'll just keep all the data and retry next scheduled transfer instead, with twice as much data.

  • It is vital that no records are duplicated, so it is important that the records that are wiped match the transferred records exactly.

  • It is likely that the peripheral server accumulates additional data while the transfer is happening, so dropping and recreating the table is unfeasible, as that'll purge some records that haven't been transferred.


I've looked at this from a few angles, and I've concluded that I am most likely trying to reinvent the wheel here, as there are various methods that almost fit my needs. I have therefore decided to take a few steps back to see what suggestions for this scenario arise. My questions then are:

  • What's a recommended transfer method when the link is unreliable?

  • Which methods for transfer verification are there?

  • How do i ensure that the wiped date match the transferred data exactly, when the peripheral database is still accumulating data during the transfer?

Using postgresql 9.4.9 on debian.


Rough outline of an idea:

  1. The table has a serial column that keeps track of the record ID.

  2. pg_dump the data to a file. The aforementioned serial is used in the filename, basically "this contains all records from X and to Y "

  3. dumpfile is copied to centralized server via rsync or the like.

  4. Centralized server loads the data, and somehow sends a verification back to the peripheral server containing the serial. Consider repeating this a few times to ensure duplicates aren't transferred later.

  5. peripheral server nukes all data pertaining to the verified serials.

Any thoughts?

1

There are 1 answers

2
Craig Ringer On BEST ANSWER

Your proposed approach is a reasonable starting point, but flawed at the first point. You are implicitly assuming that a serial column offers some reliable ordering boundary. It does not.

  1. Txn 1 starts
  2. Txn 2 starts
  3. Txn 1 gets serial value 4
  4. Txn 2 gets serial value 5
  5. Txn 2 commits
  6. Txn 3 starts
  7. Txn 3 copies out data
  8. Txn 3 notices highest committed value is 5
  9. Txn 1 commits
  10. Txn 3 deletes all data with id <= 5

Oops. Txn 3 deleted id=4, but it wouldn't have been copied. This particular case is fixed by making txn3 SERIALIZABLE, but if instead Txn3 commits and some other tx txn4 does the delete, the problem comes back.

To be safe, you must add a barrier in somewhere between 5 and 7, where you LOCK TABLE ... IN EXCLUSIVE MODE and wait for the lock to be acquired. This ensures there are no outstanding writes, everything is committed. You can now trust that no value lower than the highest-value-read can be committed later.

From a different (newly started) xact, you then pg_export_snapshot.

Once the snapshot is exported, you can rollback the xact that took the lock and let writes continue.

Leave the xact that exported the snapshot open. Pass the snapshot ID to pg_dump with --snapshot, so it dumps exactly from the point in virtual time where you know the highest committed ID. You must keep the xact holding the snapshot open until the dump completes, but writes can continue in the mean time.

You can now safely DELETE FROM ... WHERE id < x for the value of x you looked up with the lock held. You know, due to the exported snapshot, that you didn't dump anything with a higher ID. And you know, due to the lock, that you won't miss anything that commits later with a lower ID.


Seem complicated?

Instead, I suggest using PostgreSQL's logical decoding features. The tool pglogical, which uses logical decoding, has features that make what you want to do relatively easy; it gets rid of the ordering issues and guarantees that rows are replicated exactly once. It's easy to set it to replicate inserts, but not deletes, so you can just DELETE FROM mytable every now and then.

(Of course, I'm part of the dev team for pglogical and its related tool BDR, so I'm biased).

The main issue you'll likely face is that you'll have to retain more data because you'll need to retain the WAL, not just the table contents. (Technically you can actually DELETE the data as soon as it's INSERTed and it'll replicate fine, but you probably won't want to do that). So if disk space is an issue, you might want to favour the dump approach. But there are some big improvements coming in pglogical3 that will make this issue go away.