Azure Files error causing containers to fail during execution: "[Errno 11] Resource temporarily unavailable"

110 views Asked by At

I have an intermittent problem with AKS pods failing during execution. The pods have an Azure Files share specified in their Kubernetes manifests, and are able to mount this share upon starting. The mount is specified using the CSI driver provided by Microsoft.

These pods are used for ML training jobs so their workload is fairly read-heavy. Some of these pods intermittently fail during training while trying to read a file - the error message displayed is [Errno 11] Resource temporarily unavailable. It's about a 50/50 success rate.

To me this indicates intermittent connectivity issues, but there should be none as:

  1. The mount is specified using the proper CSI driver
  2. The file share is always successfully mounted on container start and I can exec into the container and browse it & read the files
  3. The file share is created in the same region as the Kubernetes cluster

I am using Kubernetes 1.26.6 on all nodes.

Here is the YAML for the StorageClass:

allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  creationTimestamp: "2023-11-09T17:26:35Z"
  labels:
    addonmanager.kubernetes.io/mode: EnsureExists
    kubernetes.io/cluster-service: "true"
  name: azurefile-premium
  resourceVersion: "4868935"
  uid: 9f564575-d36c-40fc-acc0-7ce94de80da9
mountOptions:
- async
- mfsymlinks
- actimeo=30
- nosharesock
parameters:
  skuName: Premium_LRS
provisioner: file.csi.azure.com
reclaimPolicy: Delete
volumeBindingMode: Immediate

The PV:

apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: file.csi.azure.com
    volume.kubernetes.io/provisioner-deletion-secret-name: ""
    volume.kubernetes.io/provisioner-deletion-secret-namespace: ""
  creationTimestamp: "2023-11-09T23:44:00Z"
  finalizers:
  - kubernetes.io/pv-protection
  name: pvc-331bb244-14c2-4510-a5b5-2535bc022ee2
  resourceVersion: "6443272"
  uid: c8ab511f-4876-413b-86e2-57e30e20244a
spec:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 30Ti
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: machine-learning-team-data
    namespace: clearml
    resourceVersion: "6443270"
    uid: 98e13a9e-ce8f-479e-b88d-acae5d1d684b
  csi:
    driver: file.csi.azure.com
    volumeAttributes:
      csi.storage.k8s.io/pv/name: pvc-331bb244-14c2-4510-a5b5-2535bc022ee2
      csi.storage.k8s.io/pvc/name: machine-learning-team-data
      csi.storage.k8s.io/pvc/namespace: clearml
      secretnamespace: clearml
      skuName: Premium_LRS
      storage.kubernetes.io/csiProvisionerIdentity: 1699550739022-5062-file.csi.azure.com
    volumeHandle: mlops-nodes#f25531b5495a8486281d07a#pvc-331bb244-14c2-4510-a5b5-2535bc022ee2###clearml
  mountOptions:
  - mfsymlinks
  - actimeo=30
  - nosharesock
  persistentVolumeReclaimPolicy: Retain
  storageClassName: azurefile-premium
  volumeMode: Filesystem
status:
  phase: Bound
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: file.csi.azure.com
    volume.kubernetes.io/provisioner-deletion-secret-name: ""
    volume.kubernetes.io/provisioner-deletion-secret-namespace: ""
  creationTimestamp: "2023-11-09T23:44:00Z"
  finalizers:
  - kubernetes.io/pv-protection
  name: pvc-331bb244-14c2-4510-a5b5-2535bc022ee2
  resourceVersion: "6443272"
  uid: c8ab511f-4876-413b-86e2-57e30e20244a
spec:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 30Ti
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: machine-learning-team-data
    namespace: clearml
    resourceVersion: "6443270"
    uid: 98e13a9e-ce8f-479e-b88d-acae5d1d684b
  csi:
    driver: file.csi.azure.com
    volumeAttributes:
      csi.storage.k8s.io/pv/name: pvc-331bb244-14c2-4510-a5b5-2535bc022ee2
      csi.storage.k8s.io/pvc/name: machine-learning-team-data
      csi.storage.k8s.io/pvc/namespace: clearml
      secretnamespace: clearml
      skuName: Premium_LRS
      storage.kubernetes.io/csiProvisionerIdentity: 1699550739022-5062-file.csi.azure.com
    volumeHandle: mlops-nodes#f25531b5495a8486281d07a#pvc-331bb244-14c2-4510-a5b5-2535bc022ee2###clearml
  mountOptions:
  - mfsymlinks
  - actimeo=30
  - nosharesock
  persistentVolumeReclaimPolicy: Retain
  storageClassName: azurefile-premium
  volumeMode: Filesystem
status:
  phase: Bound

0

There are 0 answers