Got this error (on replica) while replicating between 2 Postgres instances:
ERROR: connection to other side has died
Here is the logs on the replica/subscriber:
2017-09-15 20:03:55 UTC [14335-3] LOG: apply worker [14335] at slot 7 generation 109 crashed
2017-09-15 20:03:55 UTC [2961-1732] LOG: worker process: pglogical apply 16384:3661733826 (PID 14335) exited with exit code 1
2017-09-15 20:03:59 UTC [14331-2] ERROR: connection to other side has died
2017-09-15 20:03:59 UTC [14331-3] LOG: apply worker [14331] at slot 2 generation 132 crashed
2017-09-15 20:03:59 UTC [2961-1733] LOG: worker process: pglogical apply 16384:3423246629 (PID 14331) exited with exit code 1
2017-09-15 20:04:02 UTC [14332-2] ERROR: connection to other side has died
2017-09-15 20:04:02 UTC [14332-3] LOG: apply worker [14332] at slot 4 generation 125 crashed
2017-09-15 20:04:02 UTC [2961-1734] LOG: worker process: pglogical apply 16384:2660030132 (PID 14332) exited with exit code 1
2017-09-15 20:04:02 UTC [14350-1] LOG: starting apply for subscription parking_sub
2017-09-15 20:04:05 UTC [14334-2] ERROR: connection to other side has died
2017-09-15 20:04:05 UTC [14334-3] LOG: apply worker [14334] at slot 6 generation 119 crashed
2017-09-15 20:04:05 UTC [2961-1735] LOG: worker process: pglogical apply 16384:394989729 (PID 14334) exited with exit code 1
2017-09-15 20:04:06 UTC [14333-2] ERROR: connection to other side has died
Logs on master/provider:
2017-09-15 23:22:43 UTC [22068-5] repuser@ga-master ERROR: got sequence entry 1 for toast chunk 1703536315 instead of seq 0
2017-09-15 23:22:43 UTC [22068-6] repuser@ga-master LOG: could not receive data from client: Connection reset by peer
2017-09-15 23:22:44 UTC [22067-5] repuser@ga-master ERROR: got sequence entry 1 for toast chunk 1703536315 instead of seq 0
2017-09-15 23:22:44 UTC [22067-6] repuser@ga-master LOG: could not receive data from client: Connection reset by peer
2017-09-15 23:22:48 UTC [22070-5] repuser@ga-master ERROR: got sequence entry 1 for toast chunk 1703536315 instead of seq 0
2017-09-15 23:22:48 UTC [22070-6] repuser@ga-master LOG: could not receive data from client: Connection reset by peer
2017-09-15 23:22:49 UTC [22069-5] repuser@ga-master ERROR: got sequence entry 1 for toast chunk 1703536315 instead of seq 0
2017-09-15 23:22:49 UTC [22069-6] repuser@ga-master LOG: could not receive data from client: Connection reset by peer
Config on master/provider:
archive_mode = on
archive_command = 'cp %p /data/pgdata/wal_archives/%f'
max_wal_senders = 20
wal_level = logical
max_worker_processes = 100
max_replication_slots = 100
shared_preload_libraries = pglogical
max_wal_size = 20GB
Config on the replica/subscriber:
max_replication_slots = 100
shared_preload_libraries = pglogical
max_worker_processes = 100
max_wal_size = 20GB
I'm having a total of 18 subscriptions for 18 schemas. It seemed to work fine in the beginning, but it quickly deteriorated and some subscriptions started to bounce between down
and replicating
statuses, with the error posted above.
Question
What could be the possible causes? Do I need to change my Pg configurations?
Also, I noticed that when replication is going on, the CPU usage on the master/provider is pretty high.
/# ps aux | sort -nrk 3,3 | head -n 5
postgres 18180 86.4 1.0 415168 162460 ? Rs 22:32 19:03 postgres: getaround getaround 10.240.0.7(64106) CREATE INDEX
postgres 20349 37.0 0.2 339428 38452 ? Rs 22:53 0:07 postgres: wal sender process repuser 10.240.0.7(49742) idle
postgres 20351 33.8 0.2 339296 36628 ? Rs 22:53 0:06 postgres: wal sender process repuser 10.240.0.7(49746) idle
postgres 20350 28.8 0.2 339016 44024 ? Rs 22:53 0:05 postgres: wal sender process repuser 10.240.0.7(49744) idle
postgres 20352 27.6 0.2 339420 36632 ? Rs 22:53 0:04 postgres: wal sender process repuser 10.240.0.7(49750) idle
Thanks in advance!
I had a similar problem which was fixed by setting the:
wal_sender_timeout
config on the master/provider to 5 minutes (default is 1 minute). It will drop the connection if it times out - this seems to have fixed the problem for me.