Configuring PostgreSQL HA on RHEL 7.0

2k views Asked by At

We are experiencing a problem configuring PostgreSQL for HA using Corosync and Pacemaker.

crm_mon output is

Last updated: Thu Dec 18 10:24:04 2014
Last change: Thu Dec 18 10:16:30 2014 via crmd on umhtvappdpj05.arqiva.local 
Stack: corosync 
Current DC: umhtvappdpj06.arqiva.local (1) - partition with quorum 
Version: 1.1.10-29.el7-368c726 
2 Nodes configured 
4 Resources configured 


Online: [ umhtvappdpj05.arqiva.local umhtvappdpj06.arqiva.local ] 

Full list of resources: 

 Master/Slave Set: msPostgresql [pgsql] 
     Masters: [ umhtvappdpj06.arqiva.local ] 
     Slaves: [ umhtvappdpj05.arqiva.local ] 
 Resource Group: master-group 
     vip-master (ocf::heartbeat:IPaddr2):       Started umhtvappdpj06.arqiva.local 
     vip-rep    (ocf::heartbeat:IPaddr2):       Started umhtvappdpj06.arqiva.local 

Node Attributes: 
* Node umhtvappdpj05.arqiva.local: 
    + master-pgsql                      : -INFINITY 
    + pgsql-data-status                 : LATEST 
    + pgsql-status                      : HS:alone 
    + pgsql-xlog-loc                    : 0000000097000168 
* Node umhtvappdpj06.arqiva.local: 
    + master-pgsql                      : 1000 
    + pgsql-data-status                 : LATEST 
    + pgsql-master-baseline             : 0000000094000090 
    + pgsql-status                      : PRI 

Migration summary: 
* Node umhtvappdpj05.arqiva.local: 
* Node umhtvappdpj06.arqiva.local:`

Here node 06(umhtvappdpj06.arqiva.local) started as primary and node 05(umhtvappdpj05.arqiva.local) acts as standby but both are not connected.

recovery.conf on node 05
standby_mode = 'on' 
primary_conninfo = 'host=10.52.6.95 port=5432 user=postgres application_name=umhtvappdpj05.arqiva.local keepalives_idle=60 keepalives_interval=5 keepalives_count=5' 
restore_command = 'scp 10.52.6.85:/var/lib/pgsql/pg_archive/%f %p' 
recovery_target_timeline = 'latest'` 

Resources created are:

pcs resource create vip-master IPaddr2 \ 
ip="10.52.6.94" \ 
nic="ens192" \ 
cidr_netmask="24" \ 
op start   timeout="60s" interval="0s"  on-fail="restart" \ 
op monitor timeout="60s" interval="10s" on-fail="restart" \ 
op stop    timeout="60s" interval="0s"  on-fail="block" 

pcs resource create vip-rep IPaddr2 \ 
ip="10.52.6.95" \ 
nic="ens192" \ 
cidr_netmask="24" \ 
meta migration-threshold="0" \ 
op start   timeout="60s" interval="0s"  on-fail="stop" \ 
op monitor timeout="60s" interval="10s" on-fail="restart" \ 
op stop    timeout="60s" interval="0s"  on-fail="ignore" 

pcs resource create pgsql ocf:heartbeat:pgsql \ 
pgctl="/usr/pgsql-9.3/bin/pg_ctl" \ 
psql="/usr/pgsql-9.3/bin/psql" \ 
pgdata="/pgdata/data" \ 
rep_mode="sync" \ 
node_list="10.52.6.85 10.52.6.92" \ 
restore_command="scp 10.52.6.85:/var/lib/pgsql/pg_archive/%f %p" \ 
master_ip="10.52.6.95" \ 
primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" \ 
restart_on_promote='true' \ 
op start   timeout="60s" interval="0s"  on-fail="restart" \ 
op monitor timeout="60s" interval="10s" on-fail="restart" \ 
op monitor timeout="60s" interval="9s"  on-fail="restart" role="Master" \ 
op promote timeout="60s" interval="0s"  on-fail="restart" \ 
op demote  timeout="60s" interval="0s"  on-fail="stop" \ 
op stop    timeout="60s" interval="0s"  on-fail="block" \ 
op notify  timeout="60s" interval="0s" 

[root@umhtvappdpj05 data]# pcs resource show --all 
 Master/Slave Set: msPostgresql [pgsql] 
     Masters: [ umhtvappdpj06.arqiva.local ] 
     Slaves: [ umhtvappdpj05.arqiva.local ] 
 Resource Group: master-group 
     vip-master (ocf::heartbeat:IPaddr2):       Started 
     vip-rep    (ocf::heartbeat:IPaddr2):       Started 
[root@umhtvappdpj05 data]# `

The only anomaly was the corosync and pacemaker were first installed on node 6 when it was on a different subnet from node 5. Subsequently node 6 was shifted to same subnet as 5. Could this be the cause? Maybe re-install on node 6. Does seem to make sense.

thank you

Sameer

1

There are 1 answers

0
Ajay Pratap On

Here is how you can connect your replica with primary PostgreSQL.

1. Touch PGSQL.lock file  in /var/lib/pgsql/tmp/ in  umhtvappdpj05.arqiva.local
2. Stop PostgreSQL in node umhtvappdpj05.arqiva.local using systemctl
3. Do base backup/rsync of data dir/  from primary server to replica
4. Remove PGSQL.lock file from replica
5. Run pcs resource cleanup pgsql

These steps works for me always. Let me know if doesn't.