We are experiencing a problem configuring PostgreSQL for HA using Corosync and Pacemaker.
crm_mon output is
Last updated: Thu Dec 18 10:24:04 2014
Last change: Thu Dec 18 10:16:30 2014 via crmd on umhtvappdpj05.arqiva.local
Stack: corosync
Current DC: umhtvappdpj06.arqiva.local (1) - partition with quorum
Version: 1.1.10-29.el7-368c726
2 Nodes configured
4 Resources configured
Online: [ umhtvappdpj05.arqiva.local umhtvappdpj06.arqiva.local ]
Full list of resources:
Master/Slave Set: msPostgresql [pgsql]
Masters: [ umhtvappdpj06.arqiva.local ]
Slaves: [ umhtvappdpj05.arqiva.local ]
Resource Group: master-group
vip-master (ocf::heartbeat:IPaddr2): Started umhtvappdpj06.arqiva.local
vip-rep (ocf::heartbeat:IPaddr2): Started umhtvappdpj06.arqiva.local
Node Attributes:
* Node umhtvappdpj05.arqiva.local:
+ master-pgsql : -INFINITY
+ pgsql-data-status : LATEST
+ pgsql-status : HS:alone
+ pgsql-xlog-loc : 0000000097000168
* Node umhtvappdpj06.arqiva.local:
+ master-pgsql : 1000
+ pgsql-data-status : LATEST
+ pgsql-master-baseline : 0000000094000090
+ pgsql-status : PRI
Migration summary:
* Node umhtvappdpj05.arqiva.local:
* Node umhtvappdpj06.arqiva.local:`
Here node 06(umhtvappdpj06.arqiva.local) started as primary and node 05(umhtvappdpj05.arqiva.local) acts as standby but both are not connected.
recovery.conf on node 05
standby_mode = 'on'
primary_conninfo = 'host=10.52.6.95 port=5432 user=postgres application_name=umhtvappdpj05.arqiva.local keepalives_idle=60 keepalives_interval=5 keepalives_count=5'
restore_command = 'scp 10.52.6.85:/var/lib/pgsql/pg_archive/%f %p'
recovery_target_timeline = 'latest'`
Resources created are:
pcs resource create vip-master IPaddr2 \
ip="10.52.6.94" \
nic="ens192" \
cidr_netmask="24" \
op start timeout="60s" interval="0s" on-fail="restart" \
op monitor timeout="60s" interval="10s" on-fail="restart" \
op stop timeout="60s" interval="0s" on-fail="block"
pcs resource create vip-rep IPaddr2 \
ip="10.52.6.95" \
nic="ens192" \
cidr_netmask="24" \
meta migration-threshold="0" \
op start timeout="60s" interval="0s" on-fail="stop" \
op monitor timeout="60s" interval="10s" on-fail="restart" \
op stop timeout="60s" interval="0s" on-fail="ignore"
pcs resource create pgsql ocf:heartbeat:pgsql \
pgctl="/usr/pgsql-9.3/bin/pg_ctl" \
psql="/usr/pgsql-9.3/bin/psql" \
pgdata="/pgdata/data" \
rep_mode="sync" \
node_list="10.52.6.85 10.52.6.92" \
restore_command="scp 10.52.6.85:/var/lib/pgsql/pg_archive/%f %p" \
master_ip="10.52.6.95" \
primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" \
restart_on_promote='true' \
op start timeout="60s" interval="0s" on-fail="restart" \
op monitor timeout="60s" interval="10s" on-fail="restart" \
op monitor timeout="60s" interval="9s" on-fail="restart" role="Master" \
op promote timeout="60s" interval="0s" on-fail="restart" \
op demote timeout="60s" interval="0s" on-fail="stop" \
op stop timeout="60s" interval="0s" on-fail="block" \
op notify timeout="60s" interval="0s"
[root@umhtvappdpj05 data]# pcs resource show --all
Master/Slave Set: msPostgresql [pgsql]
Masters: [ umhtvappdpj06.arqiva.local ]
Slaves: [ umhtvappdpj05.arqiva.local ]
Resource Group: master-group
vip-master (ocf::heartbeat:IPaddr2): Started
vip-rep (ocf::heartbeat:IPaddr2): Started
[root@umhtvappdpj05 data]# `
The only anomaly was the corosync and pacemaker were first installed on node 6 when it was on a different subnet from node 5. Subsequently node 6 was shifted to same subnet as 5. Could this be the cause? Maybe re-install on node 6. Does seem to make sense.
thank you
Sameer
Here is how you can connect your replica with primary PostgreSQL.
These steps works for me always. Let me know if doesn't.