diff --git a/content/en/docs/concepts/scheduling-eviction/_index.md b/content/en/docs/concepts/scheduling-eviction/_index.md index 00330b5a4b3eb..91d77b01ec9c9 100644 --- a/content/en/docs/concepts/scheduling-eviction/_index.md +++ b/content/en/docs/concepts/scheduling-eviction/_index.md @@ -28,6 +28,7 @@ of terminating one or more Pods on Nodes. * [Scheduling Framework](/docs/concepts/scheduling-eviction/scheduling-framework) * [Scheduler Performance Tuning](/docs/concepts/scheduling-eviction/scheduler-perf-tuning/) * [Resource Bin Packing for Extended Resources](/docs/concepts/scheduling-eviction/resource-bin-packing/) +* [Pod Scheduling Readiness](/docs/concepts/scheduling-eviction/pod-scheduling-readiness/) ## Pod Disruption diff --git a/content/en/docs/concepts/scheduling-eviction/pod-scheduling-readiness.md b/content/en/docs/concepts/scheduling-eviction/pod-scheduling-readiness.md new file mode 100644 index 0000000000000..b8a9dd69e19b0 --- /dev/null +++ b/content/en/docs/concepts/scheduling-eviction/pod-scheduling-readiness.md @@ -0,0 +1,110 @@ +--- +title: Pod Scheduling Readiness +content_type: concept +weight: 40 +--- + + + +{{< feature-state for_k8s_version="v1.26" state="alpha" >}} + +Pods were considered ready for scheduling once created. Kubernetes scheduler +does its due diligence to find nodes to place all pending Pods. However, in a +real-world case, some Pods may stay in a "miss-essential-resources" state for a long period. +These Pods actually churn the scheduler (and downstream integrators like Cluster AutoScaler) +in an unnecessary manner. + +By specifying/removing a Pod's `.spec.schedulingGates`, you can control when a Pod is ready +to be considered for scheduling. + + + +## Configuring Pod schedulingGates + +The `schedulingGates` field contains a list of strings, and each string literal is perceived as a +criteria that Pod should be satisfied before considered schedulable. This field can be initialized +only when a Pod is created (either by the client, or mutated during admission). After creation, +each schedulingGate can be removed in arbitrary order, but addition of a new scheduling gate is disallowed. + +{{}} +stateDiagram-v2 + s1: pod created + s2: pod scheduling gated + s3: pod scheduling ready + s4: pod running + if: empty scheduling gates? + state if <> + [*] --> s1 + s1 --> if + s2 --> if: scheduling gate removed + if --> s2: no + if --> s3: yes + s3 --> s4 + s4 --> [*] +{{< /mermaid >}} + +## Usage example + +To mark a Pod not-ready for scheduling, you can create it with one or more scheduling gates like this: + +{{< codenew file="pods/pod-with-scheduling-gates.yaml" >}} + +After the Pod's creation, you can check its state using: + +```bash +kubectl get pod test-pod +``` + +The output reveals it's in `SchedulingGated` state: + +```none +NAME READY STATUS RESTARTS AGE +test-pod 0/1 SchedulingGated 0 7s +``` + +You can also check its `schedulingGates` field by running: + +```bash +kubectl get pod test-pod -o jsonpath='{.spec.schedulingGates}' +``` + +The output is: + +```none +[{"name":"foo"},{"name":"bar"}] +``` + +To inform scheduler this Pod is ready for scheduling, you can remove its `schedulingGates` entirely +by re-applying a modified manifest: + +{{< codenew file="pods/pod-without-scheduling-gates.yaml" >}} + +You can check if the `schedulingGates` is cleared by running: + +```bash +kubectl get pod test-pod -o jsonpath='{.spec.schedulingGates}' +``` + +The output is expected to be empty. And you can check its latest status by running: + +```bash +kubectl get pod test-pod -o wide +``` + +Given the test-pod doesn't request any CPU/memory resources, it's expected that this Pod's state get +transited from previous `SchedulingGated` to `Running`: + +```none +NAME READY STATUS RESTARTS AGE IP NODE +test-pod 1/1 Running 0 15s 10.0.0.4 node-2 +``` + +## Observability + +The metric `scheduler_pending_pods` comes with a new label `"gated"` to distinguish whether a Pod +has been tried scheduling but claimed as unschedulable, or explicitly marked as not ready for +scheduling. You can use `scheduler_pending_pods{queue="gated"}` to check the metric result. + +## {{% heading "whatsnext" %}} + +* Read the [PodSchedulingReadiness KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/3521-pod-scheduling-readiness) for more details diff --git a/content/en/docs/reference/command-line-tools-reference/feature-gates.md b/content/en/docs/reference/command-line-tools-reference/feature-gates.md index caa1f9a0e3a6e..5531a63ffa498 100644 --- a/content/en/docs/reference/command-line-tools-reference/feature-gates.md +++ b/content/en/docs/reference/command-line-tools-reference/feature-gates.md @@ -152,6 +152,7 @@ For a reference to old feature gates that are removed, please refer to | `PodDeletionCost` | `true` | Beta | 1.22 | | | `PodDisruptionConditions` | `false` | Alpha | 1.25 | - | | `PodHasNetworkCondition` | `false` | Alpha | 1.25 | | +| `PodSchedulingReadiness` | `false` | Alpha | 1.26 | | | `ProbeTerminationGracePeriod` | `false` | Alpha | 1.21 | 1.21 | | `ProbeTerminationGracePeriod` | `false` | Beta | 1.22 | 1.24 | | `ProbeTerminationGracePeriod` | `true` | Beta | 1.25 | | @@ -652,6 +653,7 @@ Each feature gate is designed for enabling/disabling a specific feature: pod stats from the CRI container runtime rather than gathering them from cAdvisor. - `PodDisruptionConditions`: Enables support for appending a dedicated pod condition indicating that the pod is being deleted due to a disruption. - `PodHasNetworkCondition`: Enable the kubelet to mark the [PodHasNetwork](/docs/concepts/workloads/pods/pod-lifecycle/#pod-has-network) condition on pods. +- `PodSchedulingReadiness`: Enable setting `schedulingGates` field to control a Pod's [scheduling readiness](/docs/concepts/scheduling-eviction/pod-scheduling-readiness). - `PodSecurity`: Enables the `PodSecurity` admission plugin. - `PreferNominatedNode`: This flag tells the scheduler whether the nominated nodes will be checked first before looping through all the other nodes in diff --git a/content/en/examples/pods/pod-with-scheduling-gates.yaml b/content/en/examples/pods/pod-with-scheduling-gates.yaml new file mode 100644 index 0000000000000..b0b012fb72ca8 --- /dev/null +++ b/content/en/examples/pods/pod-with-scheduling-gates.yaml @@ -0,0 +1,11 @@ +apiVersion: v1 +kind: Pod +metadata: + name: test-pod +spec: + schedulingGates: + - name: foo + - name: bar + containers: + - name: pause + image: registry.k8s.io/pause:3.6 diff --git a/content/en/examples/pods/pod-without-scheduling-gates.yaml b/content/en/examples/pods/pod-without-scheduling-gates.yaml new file mode 100644 index 0000000000000..5638b6e97af5f --- /dev/null +++ b/content/en/examples/pods/pod-without-scheduling-gates.yaml @@ -0,0 +1,8 @@ +apiVersion: v1 +kind: Pod +metadata: + name: test-pod +spec: + containers: + - name: pause + image: registry.k8s.io/pause:3.6