postgres replica crashs after a successful failover

340 views Asked by At

I have crunchy postgres operator running on a kubernetes cluster with 3 worker nodes deployed using kubespray (bare metal), I have setup one replica to switch on when the primary is down. the state of the replica was running and synced with postgres master with no lag, for test reasons, I have stopped the node which the master postgres is running on it, the failover to the replica was done, and postgres become availbale after a moment.

When I restart up the stopped node, the postgres instance on it become crashed and the lag details become unknown:

Every 2.0s: patronictl list                                                                                                                                                        
+---------------------------+-----------------------------------------+---------+---------+----+-----------+
| Member                    | Host                                    | Role    | State   | TL | Lag in MB |
+ Cluster: pg-metal-ha (7075323376834977860) -------------------------+---------+---------+----+-----------+
| pg-metal-instance1-hfdp-0 | pg-metal-instance1-hfdp-0.pg-metal-pods | Replica | running |    |   unknown |
| pg-metal-instance1-zdc6-0 | pg-metal-instance1-zdc6-0.pg-metal-pods | Leader  | running |  2 |           |
+---------------------------+-----------------------------------------+---------+---------+----+-----------+

the log of the crashed instance pod is:

psycopg2.OperationalError: FATAL:  index "pg_database_oid_index" contains unexpected zero page at block 0
HINT:  Please REINDEX it.

The hint didn't worked, I can't reindex the index "pg_database_oid_index" using psql, and this is th output of psql command:

bash-4.4$ psql
psql: error: FATAL:  index "pg_database_oid_index" contains unexpected zero page at block 0
HINT:  Please REINDEX it.

I redo the failover test many times with newly created postgres clusters, and I got the same result. is this a bug in crunchy-postgres-operator?

k8s version:

# kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:10:45Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:04:16Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}

postgres.yaml :

apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: pg-metal
  namespace: prj-metal

spec:
  image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres-gis:centos8-13.6-3.0-0
  postgresVersion: 13
  users:
    - name: pg
      options: "SUPERUSER"
  instances:
    - name: instance1
      replicas: 2
      dataVolumeClaimSpec:
        storageClassName: "ins-ls"
        accessModes:
        - "ReadWriteOnce"
        resources:
          requests:
            storage: 75Gi
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - topologyKey: kubernetes.io/hostname
            labelSelector:
              matchLabels:
                postgres-operator.crunchydata.com/cluster: pg-metal
                postgres-operator.crunchydata.com/instance-set: instance1

0

There are 0 answers