K8s FailedCreatePodSandBox PF device not found

369 views Asked by At

I am new to K8s. So, it might be slight mistake or a big on my part but, I am not able to resolve this by my own. So here I am with my setup details and problem.

I am using minikube cluster with 2 nodes on same machine.

 minikube profile list
|----------|-----------|---------|--------------|------|---------|---------|-------|--------|
| Profile  | VM Driver | Runtime |      IP      | Port | Version | Status  | Nodes | Active |
|----------|-----------|---------|--------------|------|---------|---------|-------|--------|
| minikube | docker    | docker  | 192.168.76.2 | 8443 | v1.27.4 | Running |     2 | *      |
|----------|-----------|---------|--------------|------|---------|---------|-------|--------|

I created 2 VF from one PF.

eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 0c:c4:7a:77:f9:80 brd ff:ff:ff:ff:ff:ff
    vf 0     link/ether 1e:4c:7c:6b:7d:0f brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off, query_rss off
    vf 1     link/ether 9e:52:6e:e0:e9:07 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off, query_rss off

I want to create a POD with multiple interfaces. Also I want to link 1 VF with one POD.

For that I installed SRIOV-CNI and place the SRIOV binary in /opt/cni/bin folder as suggested in documentation.

After that I downloaded sriov-network-device-plugin and created and apply config map. Below is the extract from my config file

apiVersion: v1
kind: ConfigMap
metadata:
  name: sriovdp-config
  namespace: kube-system
data:
  config.json: |
    {
        "resourceList": [{
                "resourceName": "intel_sriov_netdevice",
                "selectors": {
                    "vendors": ["8086"],
                    "devices": ["15a8"],
                    "drivers": ["ixgbevf"]
                }
            },

k -n kube-system get pod -l app=sriovdp -o wide
NAME                                   READY   STATUS    RESTARTS   AGE     IP             NODE           NOMINATED NODE   READINESS GATES
kube-sriov-device-plugin-amd64-64bf7   1/1     Running   0          5h41m   192.168.76.2   minikube       <none>           <none>
kube-sriov-device-plugin-amd64-7knxh   1/1     Running   0          5h41m   192.168.76.3   minikube-m02   <none>           <none>

My nodes are able to see my VF as well

 kubectl get node  minikube -o jsonpath='{.status.allocatable}' |jq -r '."intel.com/intel_sriov_netdevice"'
2

I apply multus daemonset as well.

k get pod -l app=multus -A -o wide
NAMESPACE     NAME                   READY   STATUS    RESTARTS   AGE     IP             NODE           NOMINATED NODE   READINESS GATES
kube-system   kube-multus-ds-47zg5   1/1     Running   0          5h38m   192.168.76.2   minikube       <none>           <none>
kube-system   kube-multus-ds-rjrzn   1/1     Running   0          5h38m   192.168.76.3   minikube-m02   <none>           <none>

After done with all this, when I am trying to launch POD, It is getting stuck. It's not coming UP.

Logs from POD are:

Events:
  Type     Reason                  Age                From               Message
  ----     ------                  ----               ----               -------
  Normal   Scheduled               15s                default-scheduler  Successfully assigned default/testpod1 to minikube-m02
  Normal   AddedInterface          14s                multus             Add eth0 [10.244.1.32/24] from kindnet
  Warning  FailedCreatePodSandBox  14s                kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "0f5d4259dfaa0087fba50ee5c656050fc6af6c01430bdd820e0642fba9d384de" network for pod "testpod1": networkPlugin cni failed to set up pod "testpod1_default" network: plugin type="multus" name="multus-cni-network" failed (add): [default/testpod1/:sriov-network]: error adding container to network "sriov-network": SRIOV-CNI failed to load netconf: LoadConf(): failed to get VF information: "PF network device not found"
  Normal   AddedInterface          13s                multus             Add eth0 [10.244.1.33/24] from kindnet
  Warning  FailedCreatePodSandBox  12s                kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "39c2cc9ceff18eb34fbd4f7b0746fd0807a3999fdcef6f8dd8487b43f812a31a" network for pod "testpod1": networkPlugin cni failed to set up pod "testpod1_default" network: plugin type="multus" name="multus-cni-network" failed (add): [default/testpod1/:sriov-network]: error adding container to network "sriov-network": SRIOV-CNI failed to load netconf: LoadConf(): failed to get VF information: "PF network device not found"
  Normal   AddedInterface          12s                multus             Add eth0 [10.244.1.34/24] from kindnet
  Warning  FailedCreatePodSandBox  11s                kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "20fb94f524db27d3b1bcb0e743c925ddb6af038c9220de28cd84ec92215b0ab3" network for pod "testpod1": networkPlugin cni failed to set up pod "testpod1_default" network: plugin type="multus" name="multus-cni-network" failed (add): [default/testpod1/:sriov-network]: error adding container to network "sriov-network": SRIOV-CNI failed to load netconf: LoadConf(): failed to get VF information: "PF network device not found"
1

There are 1 answers

0
Kiran Kotturi On

As per the article written by Vinayak Pandey, there are different scenarios that can cause the FailedCreatePodSandBox error when you attempt to create a pod. In general you can check if CNI is working on the node and if all the CNI configuration files are correct then you should also verify that the system resource limits are properly set.

Scenario 1: CNI not working on the node

The Kubernetes Container Network Interface (CNI) configures networking between pods. If CNI isn’t running properly on the nodes, pods can’t be created because they will be stuck in the ContainerCreating state.

As your environment has 2 nodes in it, you need to prevent the SRIOV-CNI from running on one node by following the steps mentioned in the article.

Debugging and resolution

The error message indicates that CNI on the node where the pod is scheduled to run is not functioning properly, so the first step should be to check if the CNI pod is running on that node. If the CNI pod is running properly, one possible root cause is eliminated. In this case, once you remove the nodeSelector from the DaemonSet definition and ensure the CNI pod is running on the node, the pod should be running fine.

Scenario 2: Missing or incorrect CNI configuration files

Even if the CNI pod is running, there may be some issues if the CNI configuration files have errors. To simulate this, you need to make some changes in the CNI configuration files, which are stored under the /etc/cni/net.d directory by following the steps mentioned in the article.

Debugging and resolution

In this scenario, verify the CNI configuration files. You can check the configuration files of other nodes of the cluster and verify if those files are similar to the ones in the problematic node. If you find any issue with the configuration files, copy the configuration files from the other nodes to that node and then try recreating the pod.