Custom Scheduler on AKS - Not Bin packing as expected

152 views Asked by At

I followed this article (https://cast.ai/blog/custom-kube-scheduler-why-and-how-to-set-it-up-in-kubernetes/) to deploy a custom scheduler on AKS and I used a OTA Gatekeeper to automatically switch all the pods selector to the custom scheduler . The idea behind deploying the custom scheduler is to see if we can acheive efficient bin packing and use de-scheduler to knock off less utilized nodes to keep the overall cost of our AKS Clusters to a minimum.

The deployment of scheduler , OTA Gatekeeper peice all seems to work and the pods use the custom scheduler to schedule the pods , but when I tried to do a quick test it appears that the CUstom scheduler is not behaving as I expect it to. Below is my test case.

  1. I have a AKS Cluster running 3 nodes which are almost empty.
  2. I have a deployment named clustercheck , that just requests 10m CPU to be scheduled.
  3. I edited the deployment to run 10 replicas and I noticed that the replicas are equally distributed across all nodes whereas I expected it all to be scheduled in one node as I am using the MostAllocated algorith in the scheduler .

Below are the configurations of the scheduler :

       - schedulerName: agys-scheduler
         pluginConfig:
           - args:
               apiVersion: kubescheduler.config.k8s.io/v1beta2
               kind: NodeResourcesFitArgs
               scoringStrategy:
                   resources:
                       - name: cpu
                         weight: 1
                       - name: memory
                         weight: 1
                   type: MostAllocated
             name: NodeResourcesFit
         plugins:
           score:
               enabled:
                   - name: NodeResourcesFit
                     weight: 1

Below is the pod to node spread I observed:

clustercheck-567c5c49f6-8g7c9   1/1     Running   0          26s     10.240.214.54    aks-subnet02-10595059-vmss000006   <none>           <none>
clustercheck-567c5c49f6-97b9p   1/1     Running   0          26s     10.240.214.9     aks-subnet02-10595059-vmss000006   <none>           <none>
clustercheck-567c5c49f6-n8n8v   1/1     Running   0          26s     10.240.214.90    aks-subnet02-10595059-vmss000009   <none>           <none>
clustercheck-567c5c49f6-njzdt   1/1     Running   0          26s     10.240.214.65    aks-subnet02-10595059-vmss000009   <none>           <none>
clustercheck-567c5c49f6-p6cfw   1/1     Running   0          5d18h   10.240.214.7     aks-subnet02-10595059-vmss000006   <none>           <none>
clustercheck-567c5c49f6-rh9zz   1/1     Running   0          26s     10.240.214.42    aks-subnet02-10595059-vmss000006   <none>           <none>
clustercheck-567c5c49f6-rqtrm   1/1     Running   0          26s     10.240.214.73    aks-subnet02-10595059-vmss000009   <none>           <none>
clustercheck-567c5c49f6-s84x4   1/1     Running   0          26s     10.240.214.5     aks-subnet02-10595059-vmss000006   <none>           <none>
clustercheck-567c5c49f6-vzcn9   1/1     Running   0          26s     10.240.214.110   aks-subnet02-10595059-vmss000007   <none>           <none>
clustercheck-567c5c49f6-x7ssx   1/1     Running   0          26s     10.240.214.8     aks-subnet02-10595059-vmss000006   <none>           <none>
webgoat-66d8d7cb57-7pl46        1/1     Running   0          5d18h   10.240.214.57    aks-subnet02-10595059-vmss000006   <none>           <none>

I reviewed the logs and I am seeing the below plugin warnings , Please see if this sheds some light around whats happening .

I1115 02:54:46.222071       1 framework.go:478] "MultiPoint plugin is explicitly re-configured; overriding" plugin="NodeResourcesFit"
I1115 02:54:46.224457       1 configfile.go:102] "Using component config" config=<
    apiVersion: kubescheduler.config.k8s.io/v1
    clientConnection:
      acceptContentTypes: ""
      burst: 100
      contentType: application/vnd.kubernetes.protobuf
      kubeconfig: ""
      qps: 50
    enableContentionProfiling: true
    enableProfiling: true
    kind: KubeSchedulerConfiguration
    leaderElection:
      leaderElect: false
      leaseDuration: 15s
      renewDeadline: 10s
      resourceLock: leases
      resourceName: kube-scheduler
      resourceNamespace: kube-system
      retryPeriod: 2s
    parallelism: 16
    percentageOfNodesToScore: 0
    podInitialBackoffSeconds: 1
    podMaxBackoffSeconds: 10
    profiles:
    - pluginConfig:
      - args:
          apiVersion: kubescheduler.config.k8s.io/v1
          kind: DefaultPreemptionArgs
          minCandidateNodesAbsolute: 100
          minCandidateNodesPercentage: 10
        name: DefaultPreemption
      - args:
          apiVersion: kubescheduler.config.k8s.io/v1
          hardPodAffinityWeight: 1
          kind: InterPodAffinityArgs
        name: InterPodAffinity
      - args:
          apiVersion: kubescheduler.config.k8s.io/v1
          kind: NodeAffinityArgs
        name: NodeAffinity
      - args:
          apiVersion: kubescheduler.config.k8s.io/v1
          kind: NodeResourcesBalancedAllocationArgs
          resources:
          - name: cpu
            weight: 1
          - name: memory
            weight: 1
        name: NodeResourcesBalancedAllocation
      - args:
          apiVersion: kubescheduler.config.k8s.io/v1
          kind: NodeResourcesFitArgs
          scoringStrategy:
            resources:
            - name: cpu
              weight: 100
            - name: memory
              weight: 100
            type: MostAllocated
        name: NodeResourcesFit
      - args:
          apiVersion: kubescheduler.config.k8s.io/v1
          defaultingType: System
          kind: PodTopologySpreadArgs
        name: PodTopologySpread
      - args:
          apiVersion: kubescheduler.config.k8s.io/v1
          bindTimeoutSeconds: 600
          kind: VolumeBindingArgs
        name: VolumeBinding
      plugins:
        bind: {}
        filter: {}
        multiPoint:
          enabled:
          - name: PrioritySort
            weight: 0
          - name: NodeUnschedulable
            weight: 0
          - name: NodeName
            weight: 0
          - name: TaintToleration
            weight: 3
          - name: NodeAffinity
            weight: 2
          - name: NodePorts
            weight: 0
          - name: NodeResourcesFit
            weight: 1
          - name: VolumeRestrictions
            weight: 0
          - name: EBSLimits
            weight: 0
          - name: GCEPDLimits
            weight: 0
          - name: NodeVolumeLimits
            weight: 0
          - name: AzureDiskLimits
            weight: 0
          - name: VolumeBinding
            weight: 0
          - name: VolumeZone
            weight: 0
          - name: PodTopologySpread
            weight: 2
          - name: InterPodAffinity
            weight: 2
          - name: DefaultPreemption
            weight: 0
          - name: NodeResourcesBalancedAllocation
            weight: 1
          - name: ImageLocality
            weight: 1
          - name: DefaultBinder
            weight: 0
        permit: {}
        postBind: {}
        postFilter: {}
        preBind: {}
        preFilter: {}
        preScore: {}
        queueSort: {}
        reserve: {}
        score:
          enabled:
          - name: NodeResourcesFit
            weight: 100
      schedulerName: agys-scheduler
1

There are 1 answers

0
Ukri Niemimuukko On

The scheduler executes a node scoring algorithm and there are many plugins which are having an impact to the final score of the nodes.

Configuring NodeResourcesFit alone won't get you what you wanted. You would need to disable at least the PodTopologySpread plugin, or use weights to make NodeResourcesFit to have more impact. Point being, there may be other plugins also in play which have a more significant effect on the score, than the NodeResourcesFit plugin which you configured. Some of which may do the exact opposite of what you'd want.

You have PodTopologySpread with weight 2, and NodeResourcesFit with weight 1. That is, NodeResourcesFit is trying to do binpacking with weight 1, and PodTopologySpread tries to do pretty much the opposite with weight 2. That won't work the way you expected. Adjust weights or disable the PodTopologySpread entirely.

The prints about NodeResourcesFit overriding your multipoint configuration is normal given your config. The specific extension points take precedence. Your config looks fine in that sense.

If you increase the log verbosity of the scheduler to a very high value (10+) you can observe the scoring algorithm in action. Grep is your friend with high verbosity.