I have an intermittent problem with AKS pods failing during execution. The pods have an Azure Files share specified in their Kubernetes manifests, and are able to mount this share upon starting. The mount is specified using the CSI driver provided by Microsoft.
These pods are used for ML training jobs so their workload is fairly read-heavy. Some of these pods intermittently fail during training while trying to read a file - the error message displayed is [Errno 11] Resource temporarily unavailable
. It's about a 50/50 success rate.
To me this indicates intermittent connectivity issues, but there should be none as:
- The mount is specified using the proper CSI driver
- The file share is always successfully mounted on container start and I can exec into the container and browse it & read the files
- The file share is created in the same region as the Kubernetes cluster
I am using Kubernetes 1.26.6 on all nodes.
Here is the YAML for the StorageClass:
allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
creationTimestamp: "2023-11-09T17:26:35Z"
labels:
addonmanager.kubernetes.io/mode: EnsureExists
kubernetes.io/cluster-service: "true"
name: azurefile-premium
resourceVersion: "4868935"
uid: 9f564575-d36c-40fc-acc0-7ce94de80da9
mountOptions:
- async
- mfsymlinks
- actimeo=30
- nosharesock
parameters:
skuName: Premium_LRS
provisioner: file.csi.azure.com
reclaimPolicy: Delete
volumeBindingMode: Immediate
The PV:
apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
pv.kubernetes.io/provisioned-by: file.csi.azure.com
volume.kubernetes.io/provisioner-deletion-secret-name: ""
volume.kubernetes.io/provisioner-deletion-secret-namespace: ""
creationTimestamp: "2023-11-09T23:44:00Z"
finalizers:
- kubernetes.io/pv-protection
name: pvc-331bb244-14c2-4510-a5b5-2535bc022ee2
resourceVersion: "6443272"
uid: c8ab511f-4876-413b-86e2-57e30e20244a
spec:
accessModes:
- ReadWriteMany
capacity:
storage: 30Ti
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: machine-learning-team-data
namespace: clearml
resourceVersion: "6443270"
uid: 98e13a9e-ce8f-479e-b88d-acae5d1d684b
csi:
driver: file.csi.azure.com
volumeAttributes:
csi.storage.k8s.io/pv/name: pvc-331bb244-14c2-4510-a5b5-2535bc022ee2
csi.storage.k8s.io/pvc/name: machine-learning-team-data
csi.storage.k8s.io/pvc/namespace: clearml
secretnamespace: clearml
skuName: Premium_LRS
storage.kubernetes.io/csiProvisionerIdentity: 1699550739022-5062-file.csi.azure.com
volumeHandle: mlops-nodes#f25531b5495a8486281d07a#pvc-331bb244-14c2-4510-a5b5-2535bc022ee2###clearml
mountOptions:
- mfsymlinks
- actimeo=30
- nosharesock
persistentVolumeReclaimPolicy: Retain
storageClassName: azurefile-premium
volumeMode: Filesystem
status:
phase: Bound
apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
pv.kubernetes.io/provisioned-by: file.csi.azure.com
volume.kubernetes.io/provisioner-deletion-secret-name: ""
volume.kubernetes.io/provisioner-deletion-secret-namespace: ""
creationTimestamp: "2023-11-09T23:44:00Z"
finalizers:
- kubernetes.io/pv-protection
name: pvc-331bb244-14c2-4510-a5b5-2535bc022ee2
resourceVersion: "6443272"
uid: c8ab511f-4876-413b-86e2-57e30e20244a
spec:
accessModes:
- ReadWriteMany
capacity:
storage: 30Ti
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: machine-learning-team-data
namespace: clearml
resourceVersion: "6443270"
uid: 98e13a9e-ce8f-479e-b88d-acae5d1d684b
csi:
driver: file.csi.azure.com
volumeAttributes:
csi.storage.k8s.io/pv/name: pvc-331bb244-14c2-4510-a5b5-2535bc022ee2
csi.storage.k8s.io/pvc/name: machine-learning-team-data
csi.storage.k8s.io/pvc/namespace: clearml
secretnamespace: clearml
skuName: Premium_LRS
storage.kubernetes.io/csiProvisionerIdentity: 1699550739022-5062-file.csi.azure.com
volumeHandle: mlops-nodes#f25531b5495a8486281d07a#pvc-331bb244-14c2-4510-a5b5-2535bc022ee2###clearml
mountOptions:
- mfsymlinks
- actimeo=30
- nosharesock
persistentVolumeReclaimPolicy: Retain
storageClassName: azurefile-premium
volumeMode: Filesystem
status:
phase: Bound