How does the Kubernetes API server start a newly scheduled pod on a node?

961 views Asked by At

I'm trying to get a better view 'under the hood' of how the Kubernetes Pod scheduling and creation process works, with respect to the interaction between kubelet and kube-apiserver.

I understand that the Kubernetes scheduler chooses a node to allocate a new pod to and notifies the API server of this. However, I am unclear how the API server notifies the kubelet on the node in question to start the pod. Is there a polling process within kubelet that queries the API server for changes? Or is there an event listener / call-back type interaction?

If anyone knows the answer or could point me in the direction of some documentation that would be greatly appreciated!

2

There are 2 answers

0
Blokje5 On BEST ANSWER

Alibaba had a really insightful blog post on the inner workings of the scheduler. From the blog:


The scheduler basically works like this:

  • The scheduler maintains a scheduled podQueue and listens to the APIServer.
  • When we create a Pod, we first write Pod metadata to etcd through the APIServer.
  • The scheduler listens to the Pod status through Informer. When a new Pod is added, the Pod is added to the podQueue.
  • The main process continuously extracts Pods from the podQueue and assigns nodes to Pods.
  • The scheduling process consists of two steps: Filter matching nodes and prioritize these nodes based on Pod configuration (for example, by metrics like resource usage and affinity) to score nodes and select the node with the highest score.
  • After a node is assigned successfully, invoke the binding pod interface of the apiServer and set pod.Spec.NodeName to the assigned pod.
  • The kubelet on the node also listens to the ApiServer. If it finds that a new Pod is scheduled to that node, the local dockerDaemon is invoked to run the container.
  • If the scheduler fails to schedule a Pod, if priority and preemption is enabled, first a preemption attempt is made, Pods with low priority on the node are deleted and Pods to be scheduled will be scheduled to the node. If the preemption is not enabled or the preemption attempt fails, related information will be recorded in logs and Pods will be added to the end of the podQueue.

On the Kubelet polling: Actually, the API server support a "watch" mode, which uses the WebSocket protocol. In this way the Kubelet is notified of any change to Pods with the Hostname equal to the hostname of the Kubelet.

2
Max Lobur On

Answering without a link to the source code, but I'm sure kubelet works like this:

Query Parameters
...
watch   Watch for changes to the described resources and return them as a stream of add, update, and remove notifications. Specify resourceVersion.

Watch functionality is inherited from etcd (the database behind the API Server): https://etcd.io/docs/v3.2.17/learning/api/. See Watch streams:

Watches are long running requests and use gRPC streams to stream event data.

So it's a kind of long polling.