diff --git a/content/en/blog/_posts/2023-12-19-scheduler-queueinghint/index.md b/content/en/blog/_posts/2023-12-19-scheduler-queueinghint/index.md new file mode 100644 index 0000000000000..7a5aa72d20fa3 --- /dev/null +++ b/content/en/blog/_posts/2023-12-19-scheduler-queueinghint/index.md @@ -0,0 +1,113 @@ +--- +layout: blog +title: "Kubernetes v1.29: QueueingHint Brings a New Possibility to Optimize Pod Scheduling" +date: 2023-12-19T00:00:00-08:00 +slug: scheduler-queueinghint +--- + +**Author:** [Kensei Nakada](https://github.com/sanposhiho) (Mercari) + +The Kubernetes [scheduler](/docs/concepts/scheduling-eviction/kube-scheduler/) is the core +component that decides which node any new Pods should run on. +Basically, it schedules Pods **one by one**, +and thus the larger your cluster is, the more crucial the throughput of the scheduler is. + +For the Kubernetes project, the throughput of the scheduler has been an eternal challenge +over the years, SIG Scheduling have been putting effort to improve the scheduling throughput by many enhancements. + +In this blog post, I'll introduce a recent major improvement in the scheduler: a new +[scheduling context element](/docs/concepts/scheduling-eviction/scheduling-framework/#extension-points) +named _QueueingHint_. +We'll go through the explanation of the basic background knowledge of the scheduler, +and how QueueingHint improves our scheduling throughput. + +## Scheduling queue + +The scheduler stores all unscheduled Pods in an internal component that we - SIG Scheduling - +call the _scheduling queue_. + +The scheduling queue is composed of three data structures: _ActiveQ_, _BackoffQ_ and _Unschedulable Pod Pool_. +- ActiveQ: It holds newly created Pods or Pods which are ready to be retried for scheduling. +- BackoffQ: It holds Pods which are ready to be retried, but are waiting for a backoff period, which depends on the number of times the scheduled attempted to schedule the Pod. +- Unschedulable Pod Pool: It holds Pods which should not be scheduled for now, because they have a Scheduling Gate or because the scheduler attempted to schedule them and nothing has changed in the cluster that could make the Pod schedulable. + +## Scheduling framework and plugins + +The Kubernetes scheduler is implemented following the Kubernetes +[scheduling framework](/docs/concepts/scheduling-eviction/scheduling-framework/). + +And, each scheduling requirements are implemented as a plugin. +(e.g., [Pod affinity](/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity) +is implemented in the `PodAffinity` plugin.) + +The first phase, called the _scheduling cycle_, takes Pods from activeQ **one by one**, runs all plugins' logic, +and lastly decides in which Node to run the Pod, or concludes that the Pod cannot go to anywhere for now. + +If the scheduling is successful, the second phase, called the _binding cycle_, binds the Pod with +the Node by communicating the decision to the API server. +But, if it turns out that the Pod cannot go to anywhere during the scheduling cycle, +the binding cycle isn't executed; instead the Pod is moved back to the scheduling queue. +Although there are some exceptions, unscheduled Pods enter the _unschedulable pod pool_. + +Pods in Unschedulable Pod Pool are moved to ActiveQ/BackoffQ +only when Scheduling Queue identifies changes in the cluster that might be schedulable if we retry the scheduling. + +That is a crucial step because scheduling cycle is performed for Pods one by one - +if we didn't have Unschedulable Pod Pool and kept retrying the scheduling of any Pods, +multiple scheduling cycles would be wasted for Pods that have no chance to be scheduled. + +Then, how do they decide when to move a Pod back into the ActiveQ? How do they notice that Pods might be schedulable now? +Here QueueingHints come into play. + +## QueueingHint + +QueueingHint is callback function per plugin to notice an object addition/update/deletion in the cluster (we call them cluster events) +that may make Pods schedulable. + +Let's say the Pod `pod-a` has a required Pod affinity, and got rejected in scheduling cycle by the `PodAffinity` plugin +because no Node has any Pod matching the Pod affinity specification for `pod-a`. + +![pod-a got rejected by PodAffinity](./queueinghint1.svg) + +When an unscheduled Pod is put into the unschedulable pod pool, the scheduling queue +records which plugins caused the scheduling failure of the Pod. +In this example, scheduling queue notes that `pod-a` was rejected by `PodAffinity`. + +`pod-a` will never be schedulable until the PodAffinity failure is resolved somehow. +The scheduling queue uses the queueing hints from plugins that rejected the Pod, which is `PodAffinity` in the example. + +A QueueingHint subscribes to a particular kind of cluster event and make a decision whether an incoming event could make the Pod schedulable. +Thinking about when PodAffinity failure could be resolved, +one possible scenario is that an existing Pod gets a new label which matches with `pod-a`'s PodAffinity. + +The `PodAffinity` plugin's `QueueingHint` callback checks on all Pod updates happening in the cluster, +and when it catches such update, the scheduling queue moves `pod-a` to either ActiveQ or BackoffQ. + +![pod-a is moved by PodAffinity QueueingHint](./queueinghint2.svg) + +We actually already had a similar functionality (called `preCheck`) inside the scheduling queue, +which filters out cluster events based on Kubernetes core scheduling constraints - +for example, filtering out node related events when nodes aren't ready. + +But, it's not ideal because this hard-coded `preCheck` refers to in-tree plugins logic, +and it causes issues for custom plugins (for example: [#110175](https://github.com/kubernetes/kubernetes/issues/110175)). + +## What's new in v1.29 + +Within SIG Scheduling, we have been working on the development of QueueingHint since +Kubernetes v1.28. +In v1.28, only one alpha plugin (DRA) supported QueueingHint, +and in v1.29, some stable plugins started to implement QueueingHints. + +QueueingHint is not something user-facing, but we have a feature gate (`SchedulerQueueingHints`) as a safety net +because QueueingHint changes a critical path of the scheduler and adds some memory overhead, depending on how busy a cluster is. + +## Getting involved + +These features are managed by Kubernetes [SIG Scheduling](https://github.com/kubernetes/community/tree/master/sig-scheduling). + +Please join us and share your feedback. + +## How can I learn more? + +- [KEP-4247: Per-plugin callback functions for efficient requeueing in the scheduling queue](https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/4247-queueinghint/README.md) \ No newline at end of file diff --git a/content/en/blog/_posts/2023-12-19-scheduler-queueinghint/queueinghint1.svg b/content/en/blog/_posts/2023-12-19-scheduler-queueinghint/queueinghint1.svg new file mode 100644 index 0000000000000..d2dd2a6cb9027 --- /dev/null +++ b/content/en/blog/_posts/2023-12-19-scheduler-queueinghint/queueinghint1.svg @@ -0,0 +1,4 @@ + + + +
unsched
unsched
FailedBy:
PodAffinity
FailedBy:...
Queueing Hint    
Queueing Hint    
EventHandler
EventHandl...
ActiveQ
ActiveQ
BackoffQ
BackoffQ
Cluster events
Cluster events
PreEnqueue   
PreEnqueue   
Scheduling Queue
Scheduling Queue
Cluster events
Cluster events
Cluster events
Cluster events
PodAffinity
"I'm in charge of requeueing pod-a"
PodAffinity...
Go to the scheduling cycle
Go to the...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/content/en/blog/_posts/2023-12-19-scheduler-queueinghint/queueinghint2.svg b/content/en/blog/_posts/2023-12-19-scheduler-queueinghint/queueinghint2.svg new file mode 100644 index 0000000000000..9f7c65682f3ad --- /dev/null +++ b/content/en/blog/_posts/2023-12-19-scheduler-queueinghint/queueinghint2.svg @@ -0,0 +1,4 @@ + + + +
unsched
unsched
FailedBy:
PodAffinity
FailedBy:...
Queueing Hint    
Queueing Hint    
EventHandler
EventHandl...
ActiveQ
ActiveQ
BackoffQ
BackoffQ
PodUpdated
PodUpdated
PreEnqueue   
PreEnqueue   
Scheduling Queue
Scheduling Queue
PodAffinity
"Oh, this event shows that an existing Pod gets a new label which matches with PodA's PodAffinity!"
PodAffinity...
Go to the scheduling cycle
Go to the...
Text is not SVG - cannot display
\ No newline at end of file