I am facing an OpenEBS issue in my K8s Infrastructure which is deployed on AWS EKS with 3 nodes. I am deploying a statefulset of RabbitMQ with one replica. I want to persist the RabbitMQ pod data when the node goes down and the pod restarts on other node. So, I deployed OpenEBS in my cluster. I tried to terminate the node in which the pod was running, So the pod tried to restart in other node. But the pod did not start in other node and remained in ContainerCreating
State and showed me following issue -
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m28s default-scheduler Successfully assigned rabbitmq/rabbitmq-0 to ip-10-0-1-132.ap-south-1.compute.internal
Warning FailedAttachVolume 2m28s attachdetach-controller Multi-Attach error for volume "pvc-b62d32f1-de60-499a-94f8-3c4d1625353d" Volume is already exclusively attached to one node and can't be attached to another
Warning FailedMount 2m26s kubelet MountVolume.SetUp failed for volume "rabbitmq-token-m99tw" : failed to sync secret cache: timed out waiting for the condition
Warning FailedMount 25s kubelet Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[configuration data rabbitmq-token-m99tw]: timed out waiting for the condition
Then after sometime(around 5-10 minutes), the rabbitmq pod was able to start but I observed that one cstor-disk-pool pod is failing with following error -
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 7m7s (x3 over 7m9s) default-scheduler 0/2 nodes are available: 2 node(s) didn't match node selector.
Warning FailedScheduling 44s (x8 over 6m14s) default-scheduler 0/3 nodes are available: 3 node(s) didn't match node selector.
I described that cstor-disk-pool pod, and the Node-Selectors key still has the value of the old node(which was terminated)Can someone please help me with this issue? Also, We need a way to reduce the time for the rabbitmq pod to restart and get ready properly as we can't afford a downtime of 5-10 minutes of rabbitmq service for our application
For the volumes to sustain one node failures, you will need to have created:
When one of the node is gone, the volume will be able to serve data from the remaining two replicas.
(For making the pod move faster from failed node to new node, you will have to configure the tolerations appropriately. Default is 5 mins).
The cStor pools are tied to the node on which they are created. This is done to allow re-use of the data from the pool when the node comes back. There are a few solutions depending on the way your nodes and disks are configured, that can help you automate the process of running the cstor pools or move them from failed node to a new node. Could you join the Kubernetes slack #openebs channel or create an issue on the openebs github to take further help?