Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a blog post about QueueingHint #43686

Closed
wants to merge 8 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 113 additions & 0 deletions content/en/blog/_posts/2023-12-19-scheduler-queueinghint/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
---
layout: blog
title: "Kubernetes v1.29: QueueingHint Brings a New Possibility to Optimize Pod Scheduling"
date: 2023-12-19T00:00:00-08:00
slug: scheduler-queueinghint
---

**Author:** [Kensei Nakada](https://github.com/sanposhiho) (Mercari)

The Kubernetes [scheduler](/docs/concepts/scheduling-eviction/kube-scheduler/) is the core
component that decides which node any new Pods should run on.
Basically, it schedules Pods **one by one**,
and thus the larger your cluster is, the more crucial the throughput of the scheduler is.

For the Kubernetes project, the throughput of the scheduler has been an eternal challenge
over the years, SIG Scheduling have been putting effort to improve the scheduling throughput by many enhancements.

In this blog post, I'll introduce a recent major improvement in the scheduler: a new
[scheduling context element](/docs/concepts/scheduling-eviction/scheduling-framework/#extension-points)
named _QueueingHint_.
We'll go through the explanation of the basic background knowledge of the scheduler,
and how QueueingHint improves our scheduling throughput.

## Scheduling queue

The scheduler stores all unscheduled Pods in an internal component that we - SIG Scheduling -
call the _scheduling queue_.

The scheduling queue is composed of three data structures: _ActiveQ_, _BackoffQ_ and _Unschedulable Pod Pool_.
- ActiveQ: It holds newly created Pods or Pods which are ready to be retried for scheduling.
- BackoffQ: It holds Pods which are ready to be retried, but are waiting for a backoff period, which depends on the number of times the scheduled attempted to schedule the Pod.
- Unschedulable Pod Pool: It holds Pods which should not be scheduled for now, because they have a Scheduling Gate or because the scheduler attempted to schedule them and nothing has changed in the cluster that could make the Pod schedulable.

## Scheduling framework and plugins

The Kubernetes scheduler is implemented following the Kubernetes
[scheduling framework](/docs/concepts/scheduling-eviction/scheduling-framework/).

And, each scheduling requirements are implemented as a plugin.
(e.g., [Pod affinity](/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity)
is implemented in the `PodAffinity` plugin.)

The first phase, called the _scheduling cycle_, takes Pods from activeQ **one by one**, runs all plugins' logic,
and lastly decides in which Node to run the Pod, or concludes that the Pod cannot go to anywhere for now.

If the scheduling is successful, the second phase, called the _binding cycle_, binds the Pod with
the Node by communicating the decision to the API server.
But, if it turns out that the Pod cannot go to anywhere during the scheduling cycle,
the binding cycle isn't executed; instead the Pod is moved back to the scheduling queue.
Although there are some exceptions, unscheduled Pods enter the _unschedulable pod pool_.

Pods in Unschedulable Pod Pool are moved to ActiveQ/BackoffQ
only when Scheduling Queue identifies changes in the cluster that might be schedulable if we retry the scheduling.

That is a crucial step because scheduling cycle is performed for Pods one by one -
if we didn't have Unschedulable Pod Pool and kept retrying the scheduling of any Pods,
multiple scheduling cycles would be wasted for Pods that have no chance to be scheduled.

Then, how do they decide when to move a Pod back into the ActiveQ? How do they notice that Pods might be schedulable now?
Here QueueingHints come into play.

## QueueingHint

QueueingHint is callback function per plugin to notice an object addition/update/deletion in the cluster (we call them cluster events)
that may make Pods schedulable.

Let's say the Pod `pod-a` has a required Pod affinity, and got rejected in scheduling cycle by the `PodAffinity` plugin
because no Node has any Pod matching the Pod affinity specification for `pod-a`.

![pod-a got rejected by PodAffinity](./queueinghint1.svg)

When an unscheduled Pod is put into the unschedulable pod pool, the scheduling queue
records which plugins caused the scheduling failure of the Pod.
In this example, scheduling queue notes that `pod-a` was rejected by `PodAffinity`.

`pod-a` will never be schedulable until the PodAffinity failure is resolved somehow.
The scheduling queue uses the queueing hints from plugins that rejected the Pod, which is `PodAffinity` in the example.

A QueueingHint subscribes to a particular kind of cluster event and make a decision whether an incoming event could make the Pod schedulable.
Thinking about when PodAffinity failure could be resolved,
one possible scenario is that an existing Pod gets a new label which matches with `pod-a`'s PodAffinity.

The `PodAffinity` plugin's `QueueingHint` callback checks on all Pod updates happening in the cluster,
and when it catches such update, the scheduling queue moves `pod-a` to either ActiveQ or BackoffQ.

![pod-a is moved by PodAffinity QueueingHint](./queueinghint2.svg)

We actually already had a similar functionality (called `preCheck`) inside the scheduling queue,
which filters out cluster events based on Kubernetes core scheduling constraints -
for example, filtering out node related events when nodes aren't ready.

But, it's not ideal because this hard-coded `preCheck` refers to in-tree plugins logic,
and it causes issues for custom plugins (for example: [#110175](https://github.com/kubernetes/kubernetes/issues/110175)).

## What's new in v1.29

Within SIG Scheduling, we have been working on the development of QueueingHint since
Kubernetes v1.28.
In v1.28, only one alpha plugin (DRA) supported QueueingHint,
and in v1.29, some stable plugins started to implement QueueingHints.

QueueingHint is not something user-facing, but we have a feature gate (`SchedulerQueueingHints`) as a safety net
because QueueingHint changes a critical path of the scheduler and adds some memory overhead, depending on how busy a cluster is.

## Getting involved

These features are managed by Kubernetes [SIG Scheduling](https://github.com/kubernetes/community/tree/master/sig-scheduling).

Please join us and share your feedback.

## How can I learn more?

- [KEP-4247: Per-plugin callback functions for efficient requeueing in the scheduling queue](https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/4247-queueinghint/README.md)
Loading