Why k8s rolling update didn't stop update when CrashLoopBackOff pods more than maxUnavailable

863 views Asked by At

I’m trying to make use of k8s daemonset's rolling update to do the automatic rolling update when daemonset's spec.template field is changed. I intentionally put an invalid image for pods so that pods couldn't be started correctly. I suppose the rolling update could be stopped when the number of unavailable pods more than the number defined in maxUnavailable. Unfortunately, it doesn't happen, and the pods are kept updated until all pods enter CrashLoopBackOff.

I run my test in 3 nodes env: kubectl get node -A

NAME                         STATUS   ROLES    AGE   VERSION
wdc-rdops-vm05-dhcp-74-190   Ready    <none>   65d   v1.18.0
wdc-rdops-vm05-dhcp-86-61    Ready    master   65d   v1.18.0
wdc-rdops-vm05-dhcp-93-214   Ready    <none>   65d   v1.18.0

I found a similar thread in: How to automatically stop rolling update when CrashLoopBackOff? but here is for daemonSet not for deployment.

As suggested in the thread, I've added

spec:
  minReadySeconds: 120 

in order to make sure containers are running well to set pod available or unavailable status.

However, the final 3 pods are crashed

nsx-system   nsx-node-agent-9cl2v       0/3     CrashLoopBackOff      3          23s
nsx-system   nsx-node-agent-c95wb       3/3     Running               3          11m
nsx-system   nsx-node-agent-p58vs       3/3     Running               3          11m

The first deployed pod was not healthy for more than 120 seconds, it should be unavailable. However, the update was not stopped as expected, it kept going until all pods replcaed but crashed:

nsx-system     nsx-node-agent-9cl2v             0/3     CrashLoopBackOff        45         15m 
nsx-system     nsx-node-agent-6mlmq             0/3     CrashLoopBackOff        48         2m46s
nsx-system     nsx-node-agent-9fzcc             0/3     CrashLoopBackOff        57         2m59s

The complete daemonset's spec YAML: kubectl get ds -n nsx-system nsx-node-agent -o yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  creationTimestamp: "2021-02-21T11:28:03Z"
  generation: 101
  labels:
    component: nsx-node-agent
    tier: nsx-networking
    version: v1
  managedFields:
  - apiVersion: apps/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:deprecated.daemonset.template.generation: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
        f:labels:
          .: {}
          f:component: {}
          f:tier: {}
          f:version: {}
      f:spec:
        f:revisionHistoryLimit: {}
        f:selector:
          f:matchLabels:
            .: {}
            f:component: {}
            f:tier: {}
            f:version: {}
        f:template:
          f:metadata:
            f:annotations:
              .: {}
              f:container.apparmor.security.beta.kubernetes.io/nsx-node-agent: {}
            f:labels:
              .: {}
              f:component: {}
              f:tier: {}
              f:version: {}
          f:spec:
            f:containers:
              k:{"name":"nsx-kube-proxy"}:
                .: {}
                f:command: {}
                f:env:
                  .: {}
                  k:{"name":"CONTAINER_NAME"}:
                    .: {}
                    f:name: {}
                    f:value: {}
                  k:{"name":"POD_NAME"}:
                    .: {}
                    f:name: {}
                    f:valueFrom:
                      .: {}
                      f:fieldRef:
                        .: {}
                        f:apiVersion: {}
                        f:fieldPath: {}
                f:imagePullPolicy: {}
                f:livenessProbe:
                  .: {}
                  f:exec:
                    .: {}
                    f:command: {}
                  f:failureThreshold: {}
                  f:initialDelaySeconds: {}
                  f:periodSeconds: {}
                  f:successThreshold: {}
                  f:timeoutSeconds: {}
                f:name: {}
                f:resources: {}
                f:securityContext:
                  .: {}
                  f:capabilities:
                    .: {}
                    f:add: {}
                f:terminationMessagePath: {}
                f:terminationMessagePolicy: {}
                f:volumeMounts:
                  .: {}
                  k:{"mountPath":"/etc/nsx-ujo"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
                    f:readOnly: {}
                  k:{"mountPath":"/var/log/nsx-ujo"}:
                    .: {}
                    f:mountPath: {}
                  k:{"mountPath":"/var/run/openvswitch"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
              k:{"name":"nsx-node-agent"}:
                .: {}
                f:command: {}
                f:env:
                  .: {}
                  k:{"name":"CONTAINER_NAME"}:
                    .: {}
                    f:name: {}
                    f:value: {}
                  k:{"name":"POD_NAME"}:
                    .: {}
                    f:name: {}
                    f:valueFrom:
                      .: {}
                      f:fieldRef:
                        .: {}
                        f:apiVersion: {}
                        f:fieldPath: {}
                f:imagePullPolicy: {}
                f:livenessProbe:
                  .: {}
                  f:exec: {}
                  f:failureThreshold: {}
                  f:initialDelaySeconds: {}
                  f:periodSeconds: {}
                  f:successThreshold: {}
                  f:timeoutSeconds: {}
                f:name: {}
                f:resources: {}
                f:securityContext:
                  .: {}
                  f:capabilities:
                    .: {}
                    f:add: {}
                f:terminationMessagePath: {}
                f:terminationMessagePolicy: {}
                f:volumeMounts:
                  .: {}
                  k:{"mountPath":"/etc/nsx-ujo"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
                    f:readOnly: {}
                  k:{"mountPath":"/host/etc/os-release"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
                    f:readOnly: {}
                  k:{"mountPath":"/host/proc"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
                    f:readOnly: {}
                  k:{"mountPath":"/host/var/run/netns"}:
                    .: {}
                    f:mountPath: {}
                    f:mountPropagation: {}
                    f:name: {}
                  k:{"mountPath":"/var/lib/kubelet/device-plugins/"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
                    f:readOnly: {}
                  k:{"mountPath":"/var/log/nsx-ujo"}:
                    .: {}
                    f:mountPath: {}
                  k:{"mountPath":"/var/run/nsx-ujo"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
                  k:{"mountPath":"/var/run/openvswitch"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
              k:{"name":"nsx-ovs"}:
                .: {}
                f:command: {}
                f:imagePullPolicy: {}
                f:livenessProbe:
                  .: {}
                  f:exec:
                    .: {}
                    f:command: {}
                  f:failureThreshold: {}
                  f:initialDelaySeconds: {}
                  f:periodSeconds: {}
                  f:successThreshold: {}
                  f:timeoutSeconds: {}
                f:name: {}
                f:resources: {}
                f:securityContext:
                  .: {}
                  f:capabilities:
                    .: {}
                    f:add: {}
                f:terminationMessagePath: {}
                f:terminationMessagePolicy: {}
                f:volumeMounts:
                  .: {}
                  k:{"mountPath":"/etc/nsx-ujo"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
                    f:readOnly: {}
                  k:{"mountPath":"/etc/openvswitch"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
                    f:subPath: {}
                  k:{"mountPath":"/host/etc/openvswitch"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
                  k:{"mountPath":"/host/etc/os-release"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
                    f:readOnly: {}
                  k:{"mountPath":"/lib/modules"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
                    f:readOnly: {}
                  k:{"mountPath":"/sys"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
                    f:readOnly: {}
                  k:{"mountPath":"/var/log/nsx-ujo"}:
                    .: {}
                    f:mountPath: {}
                  k:{"mountPath":"/var/log/openvswitch"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
                    f:subPath: {}
                  k:{"mountPath":"/var/run/openvswitch"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
            f:dnsPolicy: {}
            f:hostNetwork: {}
            f:restartPolicy: {}
            f:schedulerName: {}
            f:securityContext: {}
            f:serviceAccount: {}
            f:serviceAccountName: {}
            f:terminationGracePeriodSeconds: {}
            f:tolerations: {}
            f:volumes:
              .: {}
              k:{"name":"device-plugins"}:
                .: {}
                f:hostPath:
                  .: {}
                  f:path: {}
                  f:type: {}
                f:name: {}
              k:{"name":"host-modules"}:
                .: {}
                f:hostPath:
                  .: {}
                  f:path: {}
                  f:type: {}
                f:name: {}
              k:{"name":"host-original-ovs-db"}:
                .: {}
                f:hostPath:
                  .: {}
                  f:path: {}
                  f:type: {}
                f:name: {}
              k:{"name":"host-os-release"}:
                .: {}
                f:hostPath:
                  .: {}
                  f:path: {}
                  f:type: {}
                f:name: {}
              k:{"name":"host-sys"}:
                .: {}
                f:hostPath:
                  .: {}
                  f:path: {}
                  f:type: {}
                f:name: {}
              k:{"name":"host-var-log-ujo"}:
                .: {}
                f:hostPath:
                  .: {}
                  f:path: {}
                  f:type: {}
                f:name: {}
              k:{"name":"netns"}:
                .: {}
                f:hostPath:
                  .: {}
                  f:path: {}
                  f:type: {}
                f:name: {}
              k:{"name":"openvswitch"}:
                .: {}
                f:hostPath:
                  .: {}
                  f:path: {}
                  f:type: {}
                f:name: {}
              k:{"name":"proc"}:
                .: {}
                f:hostPath:
                  .: {}
                  f:path: {}
                  f:type: {}
                f:name: {}
              k:{"name":"projected-volume"}:
                .: {}
                f:name: {}
                f:projected:
                  .: {}
                  f:defaultMode: {}
                  f:sources: {}
              k:{"name":"var-run-ujo"}:
                .: {}
                f:hostPath:
                  .: {}
                  f:path: {}
                  f:type: {}
                f:name: {}
        f:updateStrategy:
          f:rollingUpdate:
            .: {}
            f:maxUnavailable: {}
          f:type: {}
    manager: kubectl
    operation: Update
    time: "2021-04-19T08:07:54Z"
  - apiVersion: apps/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        f:minReadySeconds: {}
        f:template:
          f:spec:
            f:containers:
              k:{"name":"nsx-kube-proxy"}:
                f:image: {}
                f:volumeMounts:
                  k:{"mountPath":"/var/log/nsx-ujo"}:
                    f:name: {}
              k:{"name":"nsx-node-agent"}:
                f:image: {}
                f:livenessProbe:
                  f:exec:
                    f:command: {}
                f:volumeMounts:
                  k:{"mountPath":"/var/log/nsx-ujo"}:
                    f:name: {}
              k:{"name":"nsx-ovs"}:
                f:image: {}
                f:volumeMounts:
                  k:{"mountPath":"/var/log/nsx-ujo"}:
                    f:name: {}
      f:status:
        f:desiredNumberScheduled: {}
    manager: nsx-ncp-operator
    operation: Update
    time: "2021-04-27T10:01:23Z"
  - apiVersion: apps/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:currentNumberScheduled: {}
        f:numberReady: {}
        f:numberUnavailable: {}
        f:observedGeneration: {}
        f:updatedNumberScheduled: {}
    manager: kube-controller-manager
    operation: Update
    time: "2021-04-27T10:15:28Z"
  name: nsx-node-agent
  namespace: nsx-system
  resourceVersion: "14594084"
  selfLink: /apis/apps/v1/namespaces/nsx-system/daemonsets/nsx-node-agent
  uid: e3dd0951-1b31-4095-8c27-56ec9780d94e
spec:
  minReadySeconds: 120
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      component: nsx-node-agent
      tier: nsx-networking
      version: v1
  template:
    metadata:
      annotations:
        container.apparmor.security.beta.kubernetes.io/nsx-node-agent: localhost/node-agent-apparmor
      creationTimestamp: null
      labels:
        component: nsx-node-agent
        tier: nsx-networking
        version: v1
    spec:
      containers:
      - command:
        - start_node_agent
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: CONTAINER_NAME
          value: nsx-node-agent
        image: registry.access.redhat.com/ubi8/ubi:latest
        imagePullPolicy: IfNotPresent
        livenessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - check_pod_liveness nsx-node-agent 5
          failureThreshold: 5
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: nsx-node-agent
        resources: {}
        securityContext:
          capabilities:
            add:
            - NET_ADMIN
            - SYS_ADMIN
            - SYS_PTRACE
            - DAC_READ_SEARCH
            - NET_RAW
            - AUDIT_WRITE
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/nsx-ujo
          name: projected-volume
          readOnly: true
        - mountPath: /var/run/openvswitch
          name: openvswitch
        - mountPath: /var/run/nsx-ujo
          name: var-run-ujo
        - mountPath: /host/var/run/netns
          mountPropagation: HostToContainer
          name: netns
        - mountPath: /host/proc
          name: proc
          readOnly: true
        - mountPath: /var/lib/kubelet/device-plugins/
          name: device-plugins
          readOnly: true
        - mountPath: /host/etc/os-release
          name: host-os-release
          readOnly: true
        - mountPath: /var/log/nsx-ujo
          name: host-var-log-ujo
      - command:
        - start_kube_proxy
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: CONTAINER_NAME
          value: nsx-kube-proxy
        image: registry.access.redhat.com/ubi8/ubi:latest
        imagePullPolicy: IfNotPresent
        livenessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - check_pod_liveness nsx-kube-proxy 5
          failureThreshold: 5
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: nsx-kube-proxy
        resources: {}
        securityContext:
          capabilities:
            add:
            - NET_ADMIN
            - SYS_ADMIN
            - SYS_PTRACE
            - DAC_READ_SEARCH
            - NET_RAW
            - AUDIT_WRITE
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/nsx-ujo
          name: projected-volume
          readOnly: true
        - mountPath: /var/run/openvswitch
          name: openvswitch
        - mountPath: /var/log/nsx-ujo
          name: host-var-log-ujo
      - command:
        - start_ovs
        image: registry.access.redhat.com/ubi8/ubi:latest
        imagePullPolicy: IfNotPresent
        livenessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - check_pod_liveness nsx-ovs 10
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 10
        name: nsx-ovs
        resources: {}
        securityContext:
          capabilities:
            add:
            - NET_ADMIN
            - SYS_ADMIN
            - SYS_NICE
            - SYS_MODULE
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/nsx-ujo
          name: projected-volume
          readOnly: true
        - mountPath: /etc/openvswitch
          name: var-run-ujo
          subPath: openvswitch-db
        - mountPath: /var/run/openvswitch
          name: openvswitch
        - mountPath: /sys
          name: host-sys
          readOnly: true
        - mountPath: /host/etc/openvswitch
          name: host-original-ovs-db
        - mountPath: /lib/modules
          name: host-modules
          readOnly: true
        - mountPath: /host/etc/os-release
          name: host-os-release
          readOnly: true
        - mountPath: /var/log/openvswitch
          name: host-var-log-ujo
          subPath: openvswitch
        - mountPath: /var/log/nsx-ujo
          name: host-var-log-ujo
      dnsPolicy: ClusterFirst
      hostNetwork: true
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: nsx-node-agent-svc-account
      serviceAccountName: nsx-node-agent-svc-account
      terminationGracePeriodSeconds: 60
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
      - effect: NoSchedule
        key: node.kubernetes.io/not-ready
      - effect: NoSchedule
        key: node.kubernetes.io/unreachable
      volumes:
      - name: projected-volume
        projected:
          defaultMode: 420
          sources:
          - configMap:
              items:
              - key: ncp.ini
                path: ncp.ini
              name: nsx-node-agent-config
          - configMap:
              items:
              - key: version
                path: VERSION
              name: nsx-ncp-version-config
      - hostPath:
          path: /var/run/openvswitch
          type: ""
        name: openvswitch
      - hostPath:
          path: /var/run/nsx-ujo
          type: ""
        name: var-run-ujo
      - hostPath:
          path: /var/run/netns
          type: ""
        name: netns
      - hostPath:
          path: /proc
          type: ""
        name: proc
      - hostPath:
          path: /var/lib/kubelet/device-plugins/
          type: ""
        name: device-plugins
      - hostPath:
          path: /var/log/nsx-ujo
          type: DirectoryOrCreate
        name: host-var-log-ujo
      - hostPath:
          path: /sys
          type: ""
        name: host-sys
      - hostPath:
          path: /lib/modules
          type: ""
        name: host-modules
      - hostPath:
          path: /etc/openvswitch
          type: ""
        name: host-original-ovs-db
      - hostPath:
          path: /etc/os-release
          type: ""
        name: host-os-release
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate
status:
  currentNumberScheduled: 3
  desiredNumberScheduled: 3
  numberMisscheduled: 0
  numberReady: 0
  numberUnavailable: 3
  observedGeneration: 101
  updatedNumberScheduled: 3

The ds output as below: kc get ds -n nsx-system -w

NAME                DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
nsx-node-agent      3         3         0       3            0           <none>          64d

I don't understand why k8s didn't stop when the number of unavailable pods more than maxUnavailable: 1.

In addition: we see pods's age is far more than minReadySeconds

Seemly, k8's rolling update strategy doesn't follow the defined spec? It shouldn't allow this situation to happen when rolling update.

1

There are 1 answers

3
Ivan On

I don't see readiness probes defined in your manifests. Without readiness probes, Kubernetes will consider a pod to be "ready" as soon as the process is running, and will proceed with terminating other pods during a RollingUpdate.

A failing readiness probe on one pod with maxUnavailable set to 1 should stop the update - but if there is no such probe, there's nothing informing the cluster that pod is not actually ready to accept traffic.