Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a blog post about QueueingHint #43686

Closed
wants to merge 8 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
99 changes: 99 additions & 0 deletions content/en/blog/_posts/2023-11-xx-scheduler-queueinghint/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
---
layout: blog
title: "Kubernetes v1.28: QueueingHint brings a new possibility to optimize our scheduling"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.29 now, please. Also - until we change the style guide: Title Case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've marked this conversation as “not resolved” because the feedback is still applicable.

date: 2023-10-25T10:00:00-08:00
sanposhiho marked this conversation as resolved.
Show resolved Hide resolved
slug: scheduler-queueinghint
---

**Author:** [Kensei Nakada](https://github.com/sanposhiho) (Mercari)

The scheduler is the core component that decides which Node Pods run on.
Basically, it schedules Pods **one by one**,
and thus the larger your cluster is, the more crucial the throughput of the scheduler is.

The throughput of the scheduler is our eternal challenge,
over the years, SIG-Scheduling have been putting effort to improve the scheduling throughput by many enhancements.
sanposhiho marked this conversation as resolved.
Show resolved Hide resolved

In this blog post, I'll introduce a recent major improvement in the scheduler, named QueueingHint.

We'll go through the explanation of the basic background knowledge of the scheduler,
and how QueueingHint improves our scheduling throughput.
sanposhiho marked this conversation as resolved.
Show resolved Hide resolved

## Scheduling Queue
sanposhiho marked this conversation as resolved.
Show resolved Hide resolved

The scheduler has Scheduling Queue which has all unscheduled Pods.
sanposhiho marked this conversation as resolved.
Show resolved Hide resolved

Scheduling Queue is composed of three places in it - ActiveQ, BackoffQ and Unschedulable Pod Pool.
sanposhiho marked this conversation as resolved.
Show resolved Hide resolved
- ActiveQ: Pods which are ready to get scheduling.
sanposhiho marked this conversation as resolved.
Show resolved Hide resolved
- BackoffQ: Pods which are waiting for the backoff, and will be put into ActiveQ after that.
sanposhiho marked this conversation as resolved.
Show resolved Hide resolved
- Unschedulable Pod Pool: Pods which should not be scheduled for now.
sanposhiho marked this conversation as resolved.
Show resolved Hide resolved

## Scheduling Framework and Plugins

[Scheduling Framework](/docs/concepts/scheduling-eviction/scheduling-framework/)

The scheduler is implemented with Scheduling Framework.
sanposhiho marked this conversation as resolved.
Show resolved Hide resolved
And, each scheduling requirements/preferences is implemented as a plugin.
sanposhiho marked this conversation as resolved.
Show resolved Hide resolved
(e.g., PodAffinity is implemented in the PodAffinity plugin.)
sanposhiho marked this conversation as resolved.
Show resolved Hide resolved

The first phase, called Scheduling Cycle, takes Pods from activeQ **one by one**, gather all plugins' idea,
sanposhiho marked this conversation as resolved.
Show resolved Hide resolved
and lastly decides a Node to run the Pod, or concludes that the Pod cannot go to anywhere for now.
sanposhiho marked this conversation as resolved.
Show resolved Hide resolved

If the scheduling is successful, the second phase, called Binding Cycle, binds the Pod with the Node.
sanposhiho marked this conversation as resolved.
Show resolved Hide resolved
But, if it turns out that the Pod cannot go to anywhere in Scheduling Cycle,
Binding Cycle isn't executed, instead the Pod is moved back to Scheduling Queue.
There are some exception cases though, such unscheduled Pod is basically put into Unschedulable Pod Pool.
sanposhiho marked this conversation as resolved.
Show resolved Hide resolved

Pods in Unschedulable Pod Pool are moved to ActiveQ/BackoffQ
only when Scheduling Queue thinks they might be schedulable if we retry the scheduling.
sanposhiho marked this conversation as resolved.
Show resolved Hide resolved

That is a crucial step because Scheduling Cycle is performed for Pods one by one -
if we didn't have Unschedulable Pod Pool and kept retrying the scheduling of any Pods,
Scheduling Cycle is wasted for Pods with no hope to be scheduled.
sanposhiho marked this conversation as resolved.
Show resolved Hide resolved

Then, how do they decide when to move? How do they notice that Pods might be schedulable now?
sanposhiho marked this conversation as resolved.
Show resolved Hide resolved
There we go, QueueingHint comes in.
sanposhiho marked this conversation as resolved.
Show resolved Hide resolved

## QueueingHint

QueueingHint is callback functions per plugin to notice an object addition/update/deletion in the cluster (we call them cluster events)
sanposhiho marked this conversation as resolved.
Show resolved Hide resolved
that may make Pods schedulable.

Let's say PodA has a required PodAffinity, and got rejected in scheduling cycle by PodAffinity plugin
because no Node has any Pod matching with PodA's PodAffinity.
sanposhiho marked this conversation as resolved.
Show resolved Hide resolved

![PodA got rejected by PodAffinity](./queueinghint1.png)

When an unscheduled Pod is put into Unschedulable Pod Pool, Scheduling Queue remembers which plugins caused the scheduling failure of the Pod.
sanposhiho marked this conversation as resolved.
Show resolved Hide resolved
In this example, Scheduling Queue notes that PodA was rejected by PodAffinity.

PodA will never be schedulable until PodAffinity failure is resolved somehow.
sanposhiho marked this conversation as resolved.
Show resolved Hide resolved
Scheduling Queue uses QueueingHint from failure plugins, which is PodAffinity in the example.
sanposhiho marked this conversation as resolved.
Show resolved Hide resolved

QueueingHint subscribes a perticular cluster event and make a decision whether an incoming event could make the Pod schedulable.
sanposhiho marked this conversation as resolved.
Show resolved Hide resolved
Thinking about when PodAffinity failure could be resolved,
one possible scenario is that an existing Pod gets a new label which matches with PodA's PodAffinity.

PodAffinity QueueingHint checks all Pod updates happening in the cluster,
and when it catches such update, the scheduling queue moves PodA to activeQ/backoffQ.

![PodA is moved by PodAffinity QueueingHint](./queueinghint2.png)

## What's new in v1.28
sanposhiho marked this conversation as resolved.
Show resolved Hide resolved

We have been working on the development of QueueingHint since v1.27.
In v1.27, only one alpha plugin (DRA) supported QueueingHint,
and in v1.28, some stable plugins start to work with QueueingHint.
sanposhiho marked this conversation as resolved.
Show resolved Hide resolved

QueueingHint is not something user-facing, but we have a feature gate () as a safety net
sanposhiho marked this conversation as resolved.
Show resolved Hide resolved
because QueueingHint changes a critical path of the scheduler a lot.
sanposhiho marked this conversation as resolved.
Show resolved Hide resolved

## Getting involved

These features are managed by Kubernetes [SIG Scheduling](https://github.com/kubernetes/community/tree/master/sig-scheduling).

Please join us and share your feedback.

## How can I learn more?

- [KEP-4247: Per-plugin callback functions for efficient requeueing in the scheduling queue](https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/4247-queueinghint/README.md)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.