I followed this article (https://cast.ai/blog/custom-kube-scheduler-why-and-how-to-set-it-up-in-kubernetes/) to deploy a custom scheduler on AKS and I used a OTA Gatekeeper to automatically switch all the pods selector to the custom scheduler . The idea behind deploying the custom scheduler is to see if we can acheive efficient bin packing and use de-scheduler to knock off less utilized nodes to keep the overall cost of our AKS Clusters to a minimum.
The deployment of scheduler , OTA Gatekeeper peice all seems to work and the pods use the custom scheduler to schedule the pods , but when I tried to do a quick test it appears that the CUstom scheduler is not behaving as I expect it to. Below is my test case.
- I have a AKS Cluster running 3 nodes which are almost empty.
- I have a deployment named clustercheck , that just requests 10m CPU to be scheduled.
- I edited the deployment to run 10 replicas and I noticed that the replicas are equally distributed across all nodes whereas I expected it all to be scheduled in one node as I am using the MostAllocated algorith in the scheduler .
Below are the configurations of the scheduler :
- schedulerName: agys-scheduler
pluginConfig:
- args:
apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: NodeResourcesFitArgs
scoringStrategy:
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
type: MostAllocated
name: NodeResourcesFit
plugins:
score:
enabled:
- name: NodeResourcesFit
weight: 1
Below is the pod to node spread I observed:
clustercheck-567c5c49f6-8g7c9 1/1 Running 0 26s 10.240.214.54 aks-subnet02-10595059-vmss000006 <none> <none>
clustercheck-567c5c49f6-97b9p 1/1 Running 0 26s 10.240.214.9 aks-subnet02-10595059-vmss000006 <none> <none>
clustercheck-567c5c49f6-n8n8v 1/1 Running 0 26s 10.240.214.90 aks-subnet02-10595059-vmss000009 <none> <none>
clustercheck-567c5c49f6-njzdt 1/1 Running 0 26s 10.240.214.65 aks-subnet02-10595059-vmss000009 <none> <none>
clustercheck-567c5c49f6-p6cfw 1/1 Running 0 5d18h 10.240.214.7 aks-subnet02-10595059-vmss000006 <none> <none>
clustercheck-567c5c49f6-rh9zz 1/1 Running 0 26s 10.240.214.42 aks-subnet02-10595059-vmss000006 <none> <none>
clustercheck-567c5c49f6-rqtrm 1/1 Running 0 26s 10.240.214.73 aks-subnet02-10595059-vmss000009 <none> <none>
clustercheck-567c5c49f6-s84x4 1/1 Running 0 26s 10.240.214.5 aks-subnet02-10595059-vmss000006 <none> <none>
clustercheck-567c5c49f6-vzcn9 1/1 Running 0 26s 10.240.214.110 aks-subnet02-10595059-vmss000007 <none> <none>
clustercheck-567c5c49f6-x7ssx 1/1 Running 0 26s 10.240.214.8 aks-subnet02-10595059-vmss000006 <none> <none>
webgoat-66d8d7cb57-7pl46 1/1 Running 0 5d18h 10.240.214.57 aks-subnet02-10595059-vmss000006 <none> <none>
I reviewed the logs and I am seeing the below plugin warnings , Please see if this sheds some light around whats happening .
I1115 02:54:46.222071 1 framework.go:478] "MultiPoint plugin is explicitly re-configured; overriding" plugin="NodeResourcesFit"
I1115 02:54:46.224457 1 configfile.go:102] "Using component config" config=<
apiVersion: kubescheduler.config.k8s.io/v1
clientConnection:
acceptContentTypes: ""
burst: 100
contentType: application/vnd.kubernetes.protobuf
kubeconfig: ""
qps: 50
enableContentionProfiling: true
enableProfiling: true
kind: KubeSchedulerConfiguration
leaderElection:
leaderElect: false
leaseDuration: 15s
renewDeadline: 10s
resourceLock: leases
resourceName: kube-scheduler
resourceNamespace: kube-system
retryPeriod: 2s
parallelism: 16
percentageOfNodesToScore: 0
podInitialBackoffSeconds: 1
podMaxBackoffSeconds: 10
profiles:
- pluginConfig:
- args:
apiVersion: kubescheduler.config.k8s.io/v1
kind: DefaultPreemptionArgs
minCandidateNodesAbsolute: 100
minCandidateNodesPercentage: 10
name: DefaultPreemption
- args:
apiVersion: kubescheduler.config.k8s.io/v1
hardPodAffinityWeight: 1
kind: InterPodAffinityArgs
name: InterPodAffinity
- args:
apiVersion: kubescheduler.config.k8s.io/v1
kind: NodeAffinityArgs
name: NodeAffinity
- args:
apiVersion: kubescheduler.config.k8s.io/v1
kind: NodeResourcesBalancedAllocationArgs
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
name: NodeResourcesBalancedAllocation
- args:
apiVersion: kubescheduler.config.k8s.io/v1
kind: NodeResourcesFitArgs
scoringStrategy:
resources:
- name: cpu
weight: 100
- name: memory
weight: 100
type: MostAllocated
name: NodeResourcesFit
- args:
apiVersion: kubescheduler.config.k8s.io/v1
defaultingType: System
kind: PodTopologySpreadArgs
name: PodTopologySpread
- args:
apiVersion: kubescheduler.config.k8s.io/v1
bindTimeoutSeconds: 600
kind: VolumeBindingArgs
name: VolumeBinding
plugins:
bind: {}
filter: {}
multiPoint:
enabled:
- name: PrioritySort
weight: 0
- name: NodeUnschedulable
weight: 0
- name: NodeName
weight: 0
- name: TaintToleration
weight: 3
- name: NodeAffinity
weight: 2
- name: NodePorts
weight: 0
- name: NodeResourcesFit
weight: 1
- name: VolumeRestrictions
weight: 0
- name: EBSLimits
weight: 0
- name: GCEPDLimits
weight: 0
- name: NodeVolumeLimits
weight: 0
- name: AzureDiskLimits
weight: 0
- name: VolumeBinding
weight: 0
- name: VolumeZone
weight: 0
- name: PodTopologySpread
weight: 2
- name: InterPodAffinity
weight: 2
- name: DefaultPreemption
weight: 0
- name: NodeResourcesBalancedAllocation
weight: 1
- name: ImageLocality
weight: 1
- name: DefaultBinder
weight: 0
permit: {}
postBind: {}
postFilter: {}
preBind: {}
preFilter: {}
preScore: {}
queueSort: {}
reserve: {}
score:
enabled:
- name: NodeResourcesFit
weight: 100
schedulerName: agys-scheduler
The scheduler executes a node scoring algorithm and there are many plugins which are having an impact to the final score of the nodes.
Configuring NodeResourcesFit alone won't get you what you wanted. You would need to disable at least the PodTopologySpread plugin, or use weights to make NodeResourcesFit to have more impact. Point being, there may be other plugins also in play which have a more significant effect on the score, than the NodeResourcesFit plugin which you configured. Some of which may do the exact opposite of what you'd want.
You have PodTopologySpread with weight 2, and NodeResourcesFit with weight 1. That is, NodeResourcesFit is trying to do binpacking with weight 1, and PodTopologySpread tries to do pretty much the opposite with weight 2. That won't work the way you expected. Adjust weights or disable the PodTopologySpread entirely.
The prints about NodeResourcesFit overriding your multipoint configuration is normal given your config. The specific extension points take precedence. Your config looks fine in that sense.
If you increase the log verbosity of the scheduler to a very high value (10+) you can observe the scoring algorithm in action. Grep is your friend with high verbosity.