Pods go back and forth between state Running and state CrashLoopBackOff

112 views Asked by At

Cluster information:

Kubernetes version:
    root@k8s-eu-1-master:~# kubectl version
    Client Version: v1.28.2
    Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
    Server Version: v1.28.2

Cloud being used: Contabo Cloud (bare-metal) Installation method: followed these steps : https://www.linuxtechi.com/install-kubernetes-on-ubuntu-22-04/?utm_content=cmp-true Host OS: Ubuntu 22.04 CNI and version:

root@k8s-eu-1-master:~# ls /etc/cni/net.d/
10-flannel.conflist


root@k8s-eu-1-master:~# cat /etc/cni/net.d/10-flannel.conflist 
{
  "name": "cbr0",
  "cniVersion": "0.3.1",
  "plugins": [
    {
      "type": "flannel",
      "delegate": {
        "hairpinMode": true,
        "isDefaultGateway": true
      }
    },
    {
      "type": "portmap",
      "capabilities": {
        "portMappings": true
      }
    }
  ]
}

CRI and version:

Container Runtime : containerd 

root@k8s-eu-1-master:~# cat /etc/containerd/config.toml  | grep version
version = 2

Pods go back and forth between state Running and state CrashLoopBackOff

root@k8s-eu-1-master:~# kubectl get pods -n kube-system
NAME                                      READY   STATUS    RESTARTS       AGE
coredns-5dd5756b68-g2bkc                  1/1     Running   0              2d4h
coredns-5dd5756b68-gt7xt                  1/1     Running   0              2d4h
etcd-k8s-eu-1-master                      1/1     Running   1 (2d2h ago)   2d4h
kube-apiserver-k8s-eu-1-master            1/1     Running   1 (2d2h ago)   2d4h
kube-controller-manager-k8s-eu-1-master   1/1     Running   1 (2d2h ago)   2d4h
kube-proxy-7mj86                          1/1     Running   1 (2d2h ago)   2d4h
kube-proxy-7nvv5                          1/1     Running   1 (2d2h ago)   2d3h
kube-proxy-fq6vz                          1/1     Running   1 (2d2h ago)   2d4h
kube-proxy-n2nm5                          1/1     Running   1 (2d2h ago)   2d3h
kube-proxy-qhvrn                          1/1     Running   1 (2d2h ago)   2d4h
kube-proxy-tbrn4                          1/1     Running   1 (2d2h ago)   2d3h
kube-scheduler-k8s-eu-1-master            1/1     Running   1 (2d2h ago)   2d4h

root@k8s-eu-1-master:~# kubectl get pods
NAME                                          READY   STATUS             RESTARTS       AGE
arango-deployment-operator-7f59876f78-7djdr   0/1     CrashLoopBackOff   87 (11s ago)   4h58m
arango-storage-operator-6c7fdf5586-gjcrp      0/1     CrashLoopBackOff   83 (98s ago)   4h44m


root@k8s-eu-1-master:~# kubectl describe pod arango-deployment-operator-7f59876f78-7djdr
Name:             arango-deployment-operator-7f59876f78-7djdr
Namespace:        default
Priority:         0
Service Account:  arango-deployment-operator
Node:             k8s-eu-1-worker-2/xx.xxx.xxx.xxx
Start Time:       Thu, 19 Oct 2023 12:56:41 +0200
Labels:           app.kubernetes.io/instance=deployment
                  app.kubernetes.io/managed-by=Tiller
                  app.kubernetes.io/name=kube-arangodb
                  helm.sh/chart=kube-arangodb-1.2.34
                  pod-template-hash=7f59876f78
                  release=deployment
Annotations:      <none>
Status:           Running
IP:               10.244.0.6
IPs:
  IP:           10.244.0.6
Controlled By:  ReplicaSet/arango-deployment-operator-7f59876f78
Containers:
  operator:
    Container ID:  containerd://344e2967054112557a9333332f99a8ca1dc3312285c808c727de6468f8c73381
    Image:         arangodb/kube-arangodb:1.2.34
    Image ID:      docker.io/arangodb/kube-arangodb@sha256:a25d031e87ba5b0f3038ce9f346553b69760a3a065fe608727cde188602b59e8
    Port:          8528/TCP
    Host Port:     0/TCP
    Args:
      --scope=legacy
      --operator.deployment
      --mode.single
      --chaos.allowed=false
      --log.level=debug
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Thu, 19 Oct 2023 17:39:23 +0200
      Finished:     Thu, 19 Oct 2023 17:40:22 +0200
    Ready:          False
    Restart Count:  83
    Liveness:       http-get https://:8528/health delay=5s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get https://:8528/ready delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment:
      MY_POD_NAMESPACE:  default (v1:metadata.namespace)
      MY_POD_NAME:       arango-deployment-operator-7f59876f78-7djdr (v1:metadata.name)
      MY_POD_IP:          (v1:status.podIP)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-g4fbd (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-g4fbd:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 5s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 5s
Events:
  Type     Reason     Age                      From     Message
  ----     ------     ----                     ----     -------
  Warning  Unhealthy  48m (x215 over 4h48m)    kubelet  Liveness probe failed: Get "https://10.244.0.6:8528/health": dial tcp 10.244.0.6:8528: connect: connection refused
  Normal   Pulling    28m (x77 over 4h48m)     kubelet  Pulling image "arangodb/kube-arangodb:1.2.34"
  Warning  Unhealthy  13m (x565 over 4h48m)    kubelet  Readiness probe failed: Get "https://10.244.0.6:8528/ready": dial tcp 10.244.0.6:8528: connect: connection refused
  Warning  BackOff    3m28s (x968 over 4h42m)  kubelet  Back-off restarting failed container operator in pod arango-deployment-operator-7f59876f78-7djdr_default(d1d6ec8e-b413-4ab8-84d7-8f6686cd3a8a)

root@k8s-eu-1-master:~# kubectl logs arango-deployment-operator-7f59876f78-7djdr
2023-10-19T15:45:24Z INF nice to meet you operator-id=7djdr
2023-10-19T15:45:24Z INF Operator Feature agency-poll (deployment.feature.agency-poll) is disabled. operator-id=7djdr
2023-10-19T15:45:24Z INF Operator Feature deployment-spec-defaults-restore (deployment.feature.deployment-spec-defaults-restore) is enabled. operator-id=7djdr
2023-10-19T15:45:24Z INF Operator Feature encryption-rotation (deployment.feature.encryption-rotation) is disabled. operator-id=7djdr
2023-10-19T15:45:24Z INF Operator Feature enforced-resign-leadership (deployment.feature.enforced-resign-leadership) is enabled. operator-id=7djdr
2023-10-19T15:45:24Z INF Operator Feature ephemeral-volumes (deployment.feature.ephemeral-volumes) is disabled. operator-id=7djdr
2023-10-19T15:45:24Z INF Operator Feature failover-leadership (deployment.feature.failover-leadership) is disabled. operator-id=7djdr
2023-10-19T15:45:24Z INF Operator Feature force-rebuild-out-synced-shards (deployment.feature.force-rebuild-out-synced-shards) is disabled. operator-id=7djdr
2023-10-19T15:45:24Z INF Operator Feature graceful-shutdown (deployment.feature.graceful-shutdown) is enabled. operator-id=7djdr
2023-10-19T15:45:24Z INF Operator Feature init-containers-copy-resources (deployment.feature.init-containers-copy-resources) is enabled. operator-id=7djdr
2023-10-19T15:45:24Z INF Operator Feature jwt-rotation (deployment.feature.jwt-rotation) is enabled. operator-id=7djdr
2023-10-19T15:45:24Z INF Operator Feature local-storage.pass-reclaim-policy (deployment.feature.local-storage.pass-reclaim-policy) is disabled. operator-id=7djdr
2023-10-19T15:45:24Z INF Operator Feature local-volume-replacement-check (deployment.feature.local-volume-replacement-check) is disabled. operator-id=7djdr
2023-10-19T15:45:24Z INF Operator Feature maintenance (deployment.feature.maintenance) is enabled. operator-id=7djdr
2023-10-19T15:45:24Z INF Operator Feature metrics-exporter (deployment.feature.metrics-exporter) is enabled. operator-id=7djdr
2023-10-19T15:45:24Z INF Operator Feature optional-graceful-shutdown (deployment.feature.optional-graceful-shutdown) is disabled. operator-id=7djdr
2023-10-19T15:45:24Z INF Operator Feature random-pod-names (deployment.feature.random-pod-names) is disabled. operator-id=7djdr
2023-10-19T15:45:24Z INF Operator Feature rebalancer-v2 (deployment.feature.rebalancer-v2) is disabled. operator-id=7djdr
2023-10-19T15:45:24Z INF Operator Feature restart-policy-always (deployment.feature.restart-policy-always) is disabled. operator-id=7djdr
2023-10-19T15:45:24Z INF Operator Feature secured-containers (deployment.feature.secured-containers) is disabled. operator-id=7djdr
2023-10-19T15:45:24Z INF Operator Feature sensitive-information-protection (deployment.feature.sensitive-information-protection) is disabled. operator-id=7djdr
2023-10-19T15:45:24Z INF Operator Feature short-pod-names (deployment.feature.short-pod-names) is disabled. operator-id=7djdr
2023-10-19T15:45:24Z INF Operator Feature timezone-management (deployment.feature.timezone-management) is disabled. operator-id=7djdr
2023-10-19T15:45:24Z INF Operator Feature tls-rotation (deployment.feature.tls-rotation) is enabled. operator-id=7djdr
2023-10-19T15:45:24Z INF Operator Feature tls-sni (deployment.feature.tls-sni) is enabled. operator-id=7djdr
2023-10-19T15:45:24Z INF Operator Feature upgrade-version-check (deployment.feature.upgrade-version-check) is enabled. operator-id=7djdr
2023-10-19T15:45:24Z INF Operator Feature upgrade-version-check-v2 (deployment.feature.upgrade-version-check-v2) is disabled. operator-id=7djdr
2023-10-19T15:45:24Z INF Operator Feature version.3-10 (deployment.feature.version.3-10) is disabled. operator-id=7djdr
2023-10-19T15:45:24Z INF Starting arangodb-operator (Community), version 1.2.34 build 05e58812 operator-id=7djdr pod-name=arango-deployment-operator-7f59876f78-7djdr pod-namespace=default
2023-10-19T15:45:54Z INF Get Operations is not allowed. Continue crd=arangojobs.apps.arangodb.com operator-id=7djdr

Why suddenly both Arango Pods changed their state from running to CrashLoopBackOff ?

root@k8s-eu-1-master:~# kubectl get pods
  NAME                                          READY   STATUS             RESTARTS        AGE
  arango-deployment-operator-7f59876f78-7djdr   0/1     CrashLoopBackOff   87 (100s ago)   4h59m
  arango-storage-operator-6c7fdf5586-gjcrp      0/1     CrashLoopBackOff   83 (3m7s ago)   4h45m
  root@k8s-eu-1-master:~# 

  root@k8s-eu-1-master:~# kubectl get pods
  NAME                                          READY   STATUS             RESTARTS         AGE
  arango-deployment-operator-7f59876f78-7djdr   0/1     CrashLoopBackOff   89 (4m47s ago)   5h9m
  arango-storage-operator-6c7fdf5586-gjcrp      0/1     Running            86 (6m4s ago)    4h55m
  root@k8s-eu-1-master:~# 

Update 1) :

As kindly suggested by @Sat21343 I defined resources (memory + cpu) requests and limits ( https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#example-1 ) :

  containers:
        - name: operator
          # https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#example-1
          resources:
            requests:
              memory: "128Mi"
              cpu: "250m"
            limits:
              memory: "526Mi"
              cpu: "500m"

The same for arango-storage.yaml.

But still the pods go back and forth between Running and CrashLoopBackOff states :

root@k8s-eu-1-master:~# kubectl get pods
NAME                                          READY   STATUS    RESTARTS      AGE
arango-deployment-operator-65cd58968f-xmz5w   0/1     Running   3 (7s ago)    3m8s
arango-storage-operator-58b8cb7c78-8dlb7      0/1     Running   2 (60s ago)   3m

 root@k8s-eu-1-master:~# kubectl get pods
NAME                                          READY   STATUS             RESTARTS      AGE
arango-deployment-operator-65cd58968f-xmz5w   0/1     CrashLoopBackOff   5 (29s ago)   6m30s
arango-storage-operator-58b8cb7c78-8dlb7      0/1     CrashLoopBackOff   5 (22s ago)   6m22s

root@k8s-eu-1-master:~# kubectl get pods
NAME                                          READY   STATUS    RESTARTS      AGE
arango-deployment-operator-65cd58968f-xmz5w   0/1     Running   9 (31s ago)   19m
arango-storage-operator-58b8cb7c78-8dlb7      0/1     Running   9 (24s ago)   18m

root@k8s-eu-1-master:~# kubectl get pods
NAME                                          READY   STATUS             RESTARTS      AGE
arango-deployment-operator-65cd58968f-xmz5w   0/1     CrashLoopBackOff   9 (57s ago)   20m
arango-storage-operator-58b8cb7c78-8dlb7      0/1     CrashLoopBackOff   9 (50s ago)   20m

Update 2):

I incremented the resources requests and limits up to :

  containers:
    - name: operator
      # https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#example-1
      resources:
        requests:
          memory: "1024Mi"
          cpu: "500m"
        limits:
          memory: "2048Mi"
          cpu: "1000m"

But, still, pods go back and forth between Running State and CrashLoopBackOff State :

root@k8s-eu-1-master:~# kubectl get pods
NAME                                          READY   STATUS        RESTARTS          AGE
arango-deployment-operator-65cd58968f-xmz5w   0/1     Terminating   294 (6m ago)      17h
arango-storage-operator-58b8cb7c78-8dlb7      0/1     Terminating   294 (5m53s ago)   17h
root@k8s-eu-1-master:~# 
root@k8s-eu-1-master:~# 
root@k8s-eu-1-master:~# kubectl get pods
NAME                                         READY   STATUS        RESTARTS       AGE
arango-deployment-operator-5bd68475b-cdr9z   0/1     Running       0              7s
arango-storage-operator-58b8cb7c78-8dlb7     0/1     Terminating   294 (6m ago)   17h
root@k8s-eu-1-master:~# 
root@k8s-eu-1-master:~# 
root@k8s-eu-1-master:~# kubectl get pods
NAME                                         READY   STATUS        RESTARTS         AGE
arango-deployment-operator-5bd68475b-cdr9z   0/1     Running       0                9s
arango-storage-operator-58b8cb7c78-8dlb7     0/1     Terminating   294 (6m2s ago)   17h
root@k8s-eu-1-master:~# 
root@k8s-eu-1-master:~# 
root@k8s-eu-1-master:~# kubectl get pods
NAME                                         READY   STATUS    RESTARTS      AGE
arango-deployment-operator-5bd68475b-cdr9z   0/1     Running   5 (58s ago)   5m59s
arango-storage-operator-5bd4546bb8-g4zr5     0/1     Running   5 (45s ago)   5m45s
root@k8s-eu-1-master:~# 
root@k8s-eu-1-master:~# 
root@k8s-eu-1-master:~# 
root@k8s-eu-1-master:~# kubectl get pods
NAME                                         READY   STATUS             RESTARTS      AGE
arango-deployment-operator-5bd68475b-cdr9z   0/1     CrashLoopBackOff   5 (0s ago)    6m1s
arango-storage-operator-5bd4546bb8-g4zr5     0/1     Running            5 (47s ago)   5m47s
root@k8s-eu-1-master:~# 
root@k8s-eu-1-master:~# 
root@k8s-eu-1-master:~# kubectl get pods
NAME                                         READY   STATUS             RESTARTS      AGE
arango-deployment-operator-5bd68475b-cdr9z   0/1     CrashLoopBackOff   5 (6s ago)    6m7s
arango-storage-operator-5bd4546bb8-g4zr5     0/1     Running            5 (53s ago)   5m53s

This is the arango-deployment.yaml file : https://drive.google.com/file/d/1VfCjQih5aJUEA4HD9ddsQDrZbLmquWIQ/view?usp=share_link .

And this is the arango-storage.yaml file : https://drive.google.com/file/d/1hqHU_H2Wr5VFrJLwM9GDUHF17b7_CYIG/view?usp=sharing

I had to put the output of kubectl describe pod and kubectl describe pod in a txt file in Google Drive, because SOF didn't accept such long text : https://drive.google.com/file/d/1kZsYeKxOa5aSppV3IdS6c7-e8dnoLiiB/view?usp=share_link

Both pods are on the same node: k8s-eu-1-worker-1 , which, apparently, has no memory issue: https://drive.google.com/file/d/1cjBqezlnJ9vEEnqDlM4NVh8IgcfV2v8T/view?usp=sharing

Update 3)

Thanks to the suggestion of @Sat21343 I had a look at the syslog of the node, just after the pod in this node went from Running to CrashLoopBackOff These are the last lines of syslog :

Oct 20 15:44:10 k8s-eu-1-worker-1 kubelet[599]: I1020   
15:44:10.594513     599 scope.go:117] "RemoveContainer" containerID="3e618ac247c1392fd6a6d67fad93d187c0dfae4d2cfe77c6a8b244c831dd0852"
Oct 20 15:44:10 k8s-eu-1-worker-1 kubelet[599]: E1020 15:44:10.594988     599 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"operator\" with CrashLoopBackOff: \"back-off 2m40s restarting failed container=operator pod=arango-deployment-operator-5f4d66bd86-4pxkn_default(397bd5c4-2bfc-4ca3-bc7d-bd149932e4b8)\"" pod="default/arango-deployment-operator-5f4d66bd86-4pxkn" podUID="397bd5c4-2bfc-4ca3-bc7d-bd149932e4b8"
Oct 20 15:44:21 k8s-eu-1-worker-1 kubelet[599]: I1020 15:44:21.594619     599 scope.go:117] "RemoveContainer" containerID="3e618ac247c1392fd6a6d67fad93d187c0dfae4d2cfe77c6a8b244c831dd0852"
Oct 20 15:44:21 k8s-eu-1-worker-1 kubelet[599]: E1020 15:44:21.595036     599 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"operator\" with CrashLoopBackOff: \"back-off 2m40s restarting failed container=operator pod=arango-deployment-operator-5f4d66bd86-4pxkn_default(397bd5c4-2bfc-4ca3-bc7d-bd149932e4b8)\"" pod="default/arango-deployment-operator-5f4d66bd86-4pxkn" podUID="397bd5c4-2bfc-4ca3-bc7d-bd149932e4b8"

last lines of the node's syslog: https://drive.google.com/file/d/1Ov_vrjsRWrLl2er_QB3yDqkZ7yN19hc-/view?usp=sharing . About the very last ones: I removed all ArangoDB Deployment just to clean everything up.

What am I doing wrong? How to make the pods keep in "Running" State?

2

There are 2 answers

14
Sat21343 On

From the pod description, I could see the pod got terminated with status code 137 which means you haven't configured required memory for you container to up and running.

A 137 code is issued when a process is terminated externally because of its memory consumption. The operating system's out of memory manager (OOM) intervenes to stop the program before it destabilizes the host. Pods running in Kubernetes will show a status of OOMKilled when they encounter a 137 exit code.

To resolve this issue. I would recommend you to configure resource Request and Limit for the containers.

https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#example-1

0
Raphael10 On

After many logs analysis I opened an issue in kube-arangodb GitHub Repo: https://github.com/arangodb/kube-arangodb/issues/1456

But, as you can see here : https://github.com/arangodb/kube-arangodb/issues/1456#issuecomment-1779310532 , ArangoDB people think this is not a problem with the Arangodb Kubernetes operator, and closed my issue in GitHub Repo.

Lesson Learned: The best way to solve an issue, is to not to consider it an issue to solve... is that funny?