I have crunchy postgres operator running on a kubernetes cluster with 3 worker nodes deployed using kubespray (bare metal), I have setup one replica to switch on when the primary is down. the state of the replica was running and synced with postgres master with no lag, for test reasons, I have stopped the node which the master postgres is running on it, the failover to the replica was done, and postgres become availbale after a moment.
When I restart up the stopped node, the postgres instance on it become crashed and the lag details become unknown:
Every 2.0s: patronictl list
+---------------------------+-----------------------------------------+---------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: pg-metal-ha (7075323376834977860) -------------------------+---------+---------+----+-----------+
| pg-metal-instance1-hfdp-0 | pg-metal-instance1-hfdp-0.pg-metal-pods | Replica | running | | unknown |
| pg-metal-instance1-zdc6-0 | pg-metal-instance1-zdc6-0.pg-metal-pods | Leader | running | 2 | |
+---------------------------+-----------------------------------------+---------+---------+----+-----------+
the log of the crashed instance pod is:
psycopg2.OperationalError: FATAL: index "pg_database_oid_index" contains unexpected zero page at block 0
HINT: Please REINDEX it.
The hint didn't worked, I can't reindex the index "pg_database_oid_index" using psql, and this is th output of psql
command:
bash-4.4$ psql
psql: error: FATAL: index "pg_database_oid_index" contains unexpected zero page at block 0
HINT: Please REINDEX it.
I redo the failover test many times with newly created postgres clusters, and I got the same result. is this a bug in crunchy-postgres-operator?
k8s version:
# kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:10:45Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:04:16Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
postgres.yaml :
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
name: pg-metal
namespace: prj-metal
spec:
image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres-gis:centos8-13.6-3.0-0
postgresVersion: 13
users:
- name: pg
options: "SUPERUSER"
instances:
- name: instance1
replicas: 2
dataVolumeClaimSpec:
storageClassName: "ins-ls"
accessModes:
- "ReadWriteOnce"
resources:
requests:
storage: 75Gi
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
postgres-operator.crunchydata.com/cluster: pg-metal
postgres-operator.crunchydata.com/instance-set: instance1