Postgres Replication with pglogical: ERROR: connection to other side has died

1.6k views Asked by At

Got this error (on replica) while replicating between 2 Postgres instances:

ERROR: connection to other side has died

Here is the logs on the replica/subscriber:

    2017-09-15 20:03:55 UTC [14335-3] LOG:  apply worker [14335] at slot 7 generation 109 crashed
    2017-09-15 20:03:55 UTC [2961-1732] LOG:  worker process: pglogical apply 16384:3661733826 (PID 14335) exited with exit code 1
    2017-09-15 20:03:59 UTC [14331-2] ERROR:  connection to other side has died
    2017-09-15 20:03:59 UTC [14331-3] LOG:  apply worker [14331] at slot 2 generation 132 crashed
    2017-09-15 20:03:59 UTC [2961-1733] LOG:  worker process: pglogical apply 16384:3423246629 (PID 14331) exited with exit code 1
    2017-09-15 20:04:02 UTC [14332-2] ERROR:  connection to other side has died
    2017-09-15 20:04:02 UTC [14332-3] LOG:  apply worker [14332] at slot 4 generation 125 crashed
    2017-09-15 20:04:02 UTC [2961-1734] LOG:  worker process: pglogical apply 16384:2660030132 (PID 14332) exited with exit code 1
    2017-09-15 20:04:02 UTC [14350-1] LOG:  starting apply for subscription parking_sub
    2017-09-15 20:04:05 UTC [14334-2] ERROR:  connection to other side has died
    2017-09-15 20:04:05 UTC [14334-3] LOG:  apply worker [14334] at slot 6 generation 119 crashed
    2017-09-15 20:04:05 UTC [2961-1735] LOG:  worker process: pglogical apply 16384:394989729 (PID 14334) exited with exit code 1
    2017-09-15 20:04:06 UTC [14333-2] ERROR:  connection to other side has died

Logs on master/provider:

    2017-09-15 23:22:43 UTC [22068-5] repuser@ga-master ERROR:  got sequence entry 1 for toast chunk 1703536315 instead of seq 0
    2017-09-15 23:22:43 UTC [22068-6] repuser@ga-master LOG:  could not receive data from client: Connection reset by peer
    2017-09-15 23:22:44 UTC [22067-5] repuser@ga-master ERROR:  got sequence entry 1 for toast chunk 1703536315 instead of seq 0
    2017-09-15 23:22:44 UTC [22067-6] repuser@ga-master LOG:  could not receive data from client: Connection reset by peer
    2017-09-15 23:22:48 UTC [22070-5] repuser@ga-master ERROR:  got sequence entry 1 for toast chunk 1703536315 instead of seq 0
    2017-09-15 23:22:48 UTC [22070-6] repuser@ga-master LOG:  could not receive data from client: Connection reset by peer
    2017-09-15 23:22:49 UTC [22069-5] repuser@ga-master ERROR:  got sequence entry 1 for toast chunk 1703536315 instead of seq 0
    2017-09-15 23:22:49 UTC [22069-6] repuser@ga-master LOG:  could not receive data from client: Connection reset by peer

Config on master/provider:

    archive_mode = on
    archive_command = 'cp %p /data/pgdata/wal_archives/%f'
    max_wal_senders = 20
    wal_level = logical
    max_worker_processes = 100
    max_replication_slots = 100
    shared_preload_libraries = pglogical
    max_wal_size = 20GB

Config on the replica/subscriber:

    max_replication_slots = 100
    shared_preload_libraries = pglogical
    max_worker_processes = 100
    max_wal_size = 20GB

I'm having a total of 18 subscriptions for 18 schemas. It seemed to work fine in the beginning, but it quickly deteriorated and some subscriptions started to bounce between down and replicating statuses, with the error posted above.

Question

What could be the possible causes? Do I need to change my Pg configurations?

Also, I noticed that when replication is going on, the CPU usage on the master/provider is pretty high.

    /# ps aux | sort -nrk 3,3 | head -n 5
    postgres 18180 86.4  1.0 415168 162460 ?       Rs   22:32  19:03 postgres: getaround getaround 10.240.0.7(64106) CREATE INDEX
    postgres 20349 37.0  0.2 339428 38452 ?        Rs   22:53   0:07 postgres: wal sender process repuser 10.240.0.7(49742) idle
    postgres 20351 33.8  0.2 339296 36628 ?        Rs   22:53   0:06 postgres: wal sender process repuser 10.240.0.7(49746) idle
    postgres 20350 28.8  0.2 339016 44024 ?        Rs   22:53   0:05 postgres: wal sender process repuser 10.240.0.7(49744) idle
    postgres 20352 27.6  0.2 339420 36632 ?        Rs   22:53   0:04 postgres: wal sender process repuser 10.240.0.7(49750) idle

Thanks in advance!

1

There are 1 answers

0
Dean Povey On

I had a similar problem which was fixed by setting the: wal_sender_timeout config on the master/provider to 5 minutes (default is 1 minute). It will drop the connection if it times out - this seems to have fixed the problem for me.