From ea571068c4ef05fbb0e58deb96c5214be5446f27 Mon Sep 17 00:00:00 2001 From: John Belamaric Date: Tue, 24 Sep 2024 08:48:15 -0700 Subject: [PATCH 01/20] Initial draft of summary, motivation, goals, non-goals --- keps/prod-readiness/sig-node/4816.yaml | 3 + .../4816-dra-prioritized-list/README.md | 842 ++++++++++++++++++ .../4816-dra-prioritized-list/kep.yaml | 51 ++ 3 files changed, 896 insertions(+) create mode 100644 keps/prod-readiness/sig-node/4816.yaml create mode 100644 keps/sig-node/4816-dra-prioritized-list/README.md create mode 100644 keps/sig-node/4816-dra-prioritized-list/kep.yaml diff --git a/keps/prod-readiness/sig-node/4816.yaml b/keps/prod-readiness/sig-node/4816.yaml new file mode 100644 index 00000000000..f891036d62c --- /dev/null +++ b/keps/prod-readiness/sig-node/4816.yaml @@ -0,0 +1,3 @@ +kep-number: 4816 +alpha: + approver: "@jpbetz" diff --git a/keps/sig-node/4816-dra-prioritized-list/README.md b/keps/sig-node/4816-dra-prioritized-list/README.md new file mode 100644 index 00000000000..088a07f9ff0 --- /dev/null +++ b/keps/sig-node/4816-dra-prioritized-list/README.md @@ -0,0 +1,842 @@ + +# [KEP-4816](https://github.com/kubernetes/enhancements/issues/4816): DRA: Prioritized Alternatives in Device Requests + + + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + + + + +The [DRA Structured +Parameters](https://git.k8s.io/enhancements/keps/sig-node/4381-dra-structured-parameters) +feature has added the ability to make requests for very specific types of +devices using a `ResourceClaim`. However, the current API does not allow the +user to indicate any priority when multiple types or configurations of devices +may meet the needs of the workload. This feature allows the user to specify +alternative requests that statisfy the workloads need, giving the scheduler more +flexiblity in scheduling the workload. + +## Motivation + + + +"Obtainability" of certain types of scarce resources is a primary concern of +many AI/ML users. GPUs are in high demand, particularly the latest models. This +means that workloads that use DRA to specify a need for particular types of GPUs +may fail to schedule. In practice, a workload that needs a GPU can be written +such that it can discover the GPUs available to it, and work with what it is +given. A user may have a preference for the latest model, but would like to run +the workload even if only an older model is available. + +Similarly, packaged workload authors may wish to configure a workload such that +it will work well in the widest selection of available clusters. That is, a +distributor of shared workload definitions would like to be able to specify +alternative types of devices with which their workload will function, without +requiring the user to modify the manifests. + +### Goals + + + +* Allow workload authors, when specifying a `ResourceClaim`, to provide a list + of ways to satisfy the claim, with a preference ranking. +* Enable the scheduler to evaluate those preferences and allocate devices for the + claim based on them. +* Enable the cluster autoscaler to evaluate those preferences and make scaling + choices based on them. + +### Non-Goals + +* Enable cross-claim consistency of request choices. For example, guaranteeing + that all `ResourceClaim`s associated with a given `Deployment` are satisfied + using the same choice from the list of possible alternatives. + + + +## Proposal + + + +### User Stories (Optional) + + + +#### Story 1 + +#### Story 2 + +### Notes/Constraints/Caveats (Optional) + + + +### Risks and Mitigations + + + +## Design Details + + + +### Test Plan + + + +[ ] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +- ``: `` - `` + +##### Integration tests + + + + + +- : + +##### e2e tests + + + +- : + +### Graduation Criteria + + + +### Upgrade / Downgrade Strategy + + + +### Version Skew Strategy + + + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [ ] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: + - Components depending on the feature gate: +- [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? + +###### Does enabling the feature change any default behavior? + + + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +###### What happens if we reenable the feature if it was previously rolled back? + +###### Are there any tests for feature enablement/disablement? + + + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +###### Will enabling / using this feature result in introducing new API types? + + + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + + + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-node/4816-dra-prioritized-list/kep.yaml b/keps/sig-node/4816-dra-prioritized-list/kep.yaml new file mode 100644 index 00000000000..cead6a0c478 --- /dev/null +++ b/keps/sig-node/4816-dra-prioritized-list/kep.yaml @@ -0,0 +1,51 @@ +title: DRA Prioritized List +kep-number: 4816 +authors: + - "@johnbelamaric" +owning-sig: sig-node +participating-sigs: + - sig-scheduling + - sig-autoscaling +status: provisional +creation-date: 2024-09-24 +reviewers: + - "@pohly" + - "@klueska" + - "@thockin" +approvers: + - "@mrunalp" # SIG-Node + - "@alculquicondor" # SIG-Scheduling + - "@MaciekPytel" # SIG-Autoscaling + - "@thockin" # API Review + +see-also: + - "/keps/sig-node/4381-dra-structured-parameters" + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.32" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.32" + beta: "v1.33" + stable: "v1.34" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: DRAPrioritizedList + components: + - kube-apiserver + - kube-controller-manager + - kube-scheduler + - kubelet +disable-supported: true + +# The following PRR answers are required at beta release +metrics: + #- my_feature_metric From 76a1f298b7943c0d41b2a43cd3deedc0909165b1 Mon Sep 17 00:00:00 2001 From: John Belamaric Date: Thu, 26 Sep 2024 11:08:01 -0700 Subject: [PATCH 02/20] Review feedback, possible API --- .../4816-dra-prioritized-list/README.md | 109 +++++++++++++++++- 1 file changed, 103 insertions(+), 6 deletions(-) diff --git a/keps/sig-node/4816-dra-prioritized-list/README.md b/keps/sig-node/4816-dra-prioritized-list/README.md index 088a07f9ff0..02d14c72171 100644 --- a/keps/sig-node/4816-dra-prioritized-list/README.md +++ b/keps/sig-node/4816-dra-prioritized-list/README.md @@ -107,6 +107,7 @@ tags, and then generate with `hack/update-toc.sh`. - [Implementation History](#implementation-history) - [Drawbacks](#drawbacks) - [Alternatives](#alternatives) + - [Resource Claim Indirection](#resource-claim-indirection) - [Infrastructure Needed (Optional)](#infrastructure-needed-optional) @@ -217,22 +218,24 @@ know that this has succeeded? * Allow workload authors, when specifying a `ResourceClaim`, to provide a list of ways to satisfy the claim, with a preference ranking. -* Enable the scheduler to evaluate those preferences and allocate devices for the +* Enable schedulers to evaluate those preferences and allocate devices for the claim based on them. -* Enable the cluster autoscaler to evaluate those preferences and make scaling +* Enable cluster autoscalers to evaluate those preferences and make scaling choices based on them. +* Provide some measure of ResourceQuota controls when users utilize claims with + these types of requests. ### Non-Goals -* Enable cross-claim consistency of request choices. For example, guaranteeing - that all `ResourceClaim`s associated with a given `Deployment` are satisfied - using the same choice from the list of possible alternatives. - +* Enable cross-claim consistency of request choices. For example, guaranteeing + that all `ResourceClaim`s associated with a given `Deployment` are satisfied + using the same choice from the list of possible alternatives. + ## Proposal +The proposal adds a new type, called `RankedDeviceRequest`, which allows the +user to list `DeviceRequest`s, exactly one of which must be satisfied. The +`DeviceClaim` then gets a new field listing all of these such requests that must +be satisfied. There is no change to the existing `DeviceRequest` type. + +```go +// DeviceClaim defines how to request devices with a ResourceClaim. +type DeviceClaim struct { + // Requests represent individual requests for distinct devices which + // must all be satisfied. If empty, nothing needs to be allocated. + // + // +optional + // +listType=atomic + Requests []DeviceRequest + + // RankedRequests represents groups of requests, where exactly one + // request in each group must be satisfied. + // + // +optional + // +listType=atomic + RankedRequests []RankedDeviceRequest + + // These constraints must be satisfied by the set of devices that get + // allocated for the claim. + // + // +optional + // +listType=atomic + Constraints []DeviceConstraint + + // This field holds configuration for multiple potential drivers which + // could satisfy requests in this claim. It is ignored while allocating + // the claim. + // + // +optional + // +listType=atomic + Config []DeviceClaimConfiguration + + // Potential future extension, ignored by older schedulers. This is + // fine because scoring allows users to define a preference, without + // making it a hard requirement. + // + // Score *SomeScoringStruct +} + +const ( + DeviceRequestsMaxSize = AllocationResultsMaxSize + DeviceConstraintsMaxSize = 32 + DeviceConfigMaxSize = 32 +) + +// RankedDeviceRequest is a list of DeviceRequests, in the user's order of +// preference for allocation. +// +type RankedDeviceRequest struct { + // Name can be used to reference this request in a pod.spec.containers[].resources.claims + // entry, or in Constraints or Config. + // + // In the container spec, this is the name that must be used, rather + // the names of the underlying requests. + // + // In the Contraints or Config, this name may be used, or the underlying request + // names may be used to provide additional specificity. + // + // Must be a DNS label. + // + // +required + Name string + + + // Requests represent individual requests for distinct devices, exactly + // one of which must be satisfied. If empty, nothing needs to be allocated. + // + // +optional + // +listType=atomic + Requests []DeviceRequest +} + +const ( + RankedDeviceRequestsMaxSize = 8 +) +``` + +ResourceQuota will be enforced such that the user must have quota for each +`DeviceRequest` under every `RankedDeviceRequest`. Thus, this "pick one" +behavior cannot be used to circumvent quota. This reduces the usefuleness of the +feature, as it no longer services as a quota-management feature. However, the +primary goal of the feature is about flexibility across clusters and +obtainability of underlying devices, not quota management. + ### User Stories (Optional) +### Resource Claim Indirection + +Rather than embedding a list of alternative request objects, we could use an +umbrella `ResourceClaim` that instead references other `ResourceClaim`s. + ## Infrastructure Needed (Optional) -The proposal adds a new type, called `RankedDeviceRequest`, which allows the -user to list `DeviceRequest`s, exactly one of which must be satisfied. The -`DeviceClaim` then gets a new field listing all of these such requests that must -be satisfied. There is no change to the existing `DeviceRequest` type. +The proposal adds a new field to the `DeviceRequest`, called `FirstOf` which +will contain an ordered list of `DeviceRequest` objects. In order to satisfy the +main (containing) request, exactly one of the requests listed in `FirstOf` must +be satisfied. They order listed is considered a priority order, such that the +scheduler will only try to use the second item in the list if it is unable to +satsify the first item, and so on. + +A `DeviceRequest` that populates the `FirstOf` field must *not* populate the +`DeviceClassName` field. The `required` validation on this field will be +relaxed. This allows existing clients to differentiate between claims they +understand (with `DeviceClassName`) and those they do not (without +`DeviceClassName` but with the new field). Clients written for 1.31, when +`DeviceClassName` was required, were requested to include this logic, and the +in-tree components have been built in this way. ```go -// DeviceClaim defines how to request devices with a ResourceClaim. -type DeviceClaim struct { - // Requests represent individual requests for distinct devices which - // must all be satisfied. If empty, nothing needs to be allocated. +// DeviceRequest is a request for devices required for a claim. +// This is typically a request for a single resource like a device, but can +// also ask for several identical devices. +type DeviceRequest struct { + // Name can be used to reference this request in a pod.spec.containers[].resources.claims + // entry and in a constraint of the claim. // - // +optional - // +listType=atomic - Requests []DeviceRequest + // Must be a DNS label. + // + // +required + Name string - // RankedRequests represents groups of requests, where exactly one - // request in each group must be satisfied. All entries in this list - // must be satisfied, using exactly one of the DeviceRequests listed - // in each RankedDeviceRequest. + // DeviceClassName references a specific DeviceClass, which can define + // additional configuration and selectors to be inherited by this + // request. + // + // Either a class or FirstOf requests are required in DeviceClaim.Requests. + // When this request is part of the FirstOf list, a class is required. Nested + // FirstOf requests are not allowed + // + // Which classes are available depends on the cluster. + // + // Administrators may use this to restrict which devices may get + // requested by only installing classes with selectors for permitted + // devices. If users are free to request anything without restrictions, + // then administrators can create an empty DeviceClass for users + // to reference. // // +optional - // +listType=atomic - RankedRequests []RankedDeviceRequest + // +oneOf=deviceRequestType + DeviceClassName string - // These constraints must be satisfied by the set of devices that get - // allocated for the claim. + // FirstOf contains subrequests, exactly one of which must be satisfied + // in order to satisfy this request. This field may only be set in the + // entries of DeviceClaim.Requests. It must not be set in DeviceRequest + // instances that themselves are part of a FirstOf. // // +optional - // +listType=atomic - Constraints []DeviceConstraint + // +oneOf=deviceRequestType + FirstOf []DeviceRequest - // This field holds configuration for multiple potential drivers which - // could satisfy requests in this claim. It is ignored while allocating - // the claim. + // Selectors define criteria which must be satisfied by a specific + // device in order for that device to be considered for this + // request. All selectors must be satisfied for a device to be + // considered. // // +optional // +listType=atomic - Config []DeviceClaimConfiguration + Selectors []DeviceSelector - // Potential future extension, ignored by older schedulers. This is - // fine because scoring allows users to define a preference, without - // making it a hard requirement. + // AllocationMode and its related fields define how devices are allocated + // to satisfy this request. Supported values are: // - // Score *SomeScoringStruct -} - -const ( - DeviceRequestsMaxSize = AllocationResultsMaxSize - DeviceConstraintsMaxSize = 32 - DeviceConfigMaxSize = 32 -) - -// RankedDeviceRequest is a list of DeviceRequests, in the user's order of -// preference for allocation. -// -type RankedDeviceRequest struct { - // Name can be used to reference this request in a pod.spec.containers[].resources.claims - // entry, or in Constraints or Config. + // - ExactCount: This request is for a specific number of devices. + // This is the default. The exact number is provided in the + // count field. // - // In the pod spec, this is the name that must be used, rather - // the names of the underlying requests. + // - All: This request is for all of the matching devices in a pool. + // Allocation will fail if some devices are already allocated, + // unless adminAccess is requested. // - // In the Contraints or Config, this name may be used, or the underlying request - // names may be used to provide additional specificity. + // If AlloctionMode is not specified, the default mode is ExactCount. If + // the mode is ExactCount and count is not specified, the default count is + // one. Any other requests must specify this field. // - // Must be a DNS label. + // More modes may get added in the future. Clients must refuse to handle + // requests with unknown modes. // - // +required - Name string - + // +optional + AllocationMode DeviceAllocationMode - // Requests represent individual requests for distinct devices, exactly - // one of which must be satisfied. If empty, nothing needs to be allocated. + // Count is used only when the count mode is "ExactCount". Must be greater than zero. + // If AllocationMode is ExactCount and this field is not specified, the default is one. // // +optional - // +listType=atomic - Requests []DeviceRequest + // +oneOf=AllocationMode + Count int64 + + // AdminAccess indicates that this is a claim for administrative access + // to the device(s). Claims with AdminAccess are expected to be used for + // monitoring or other management services for a device. They ignore + // all ordinary claims to the device with respect to access modes and + // any resource allocations. + // + // +optional + // +default=false + AdminAccess bool } const ( - RankedDeviceRequestsMaxSize = 8 + DeviceSelectorsMaxSize = 32 + FirstOfDeviceRequestMaxSize = 8 ) ``` @@ -343,9 +370,8 @@ spec: requests: - name: nic deviceClassName: rdma-nic - rankedRequests: - name: gpu - requests: + firstOf: - name: big-gpu deviceClassName: big-gpu - name: mid-gpu @@ -390,22 +416,22 @@ spec: request: "gpu" # the 'nic' request is pod-level, no need to attach to container ``` -There are a few things to note here. First, the "nic" request is listed in -`requests`, because it has no alternative request types. The "gpu" request could -be met by serveral different types of GPU, in an order of preference. Each of -those is a separate `DeviceRequest`, and thus also has its own name. This allow -us to apply constraints or configuration to specific, individual requests, in -the event even that it is the chosen alternative. In this example, the -"small-gpu" choice requires a configuration option that the other two choices do -not need. Thus, if the resolution of the "gpu" request is made using the -"small-gpu" subrequest, then that configuration will be attached to the -allocation. Otherwise, it will not. - -Similarly, for `Constraints`, the list of requests can include the ranked -request name ("gpu" in this case), in which case the constraint applies -regardless of which alternative is chosen. Or, it can include the subrequest -name, in which case that constraint only applies if that particular subrequest -is chosen. +There are a few things to note here. First, the "nic" request is listed with a +`deviceClassName`, because it has no alternative request types. The "gpu" +request could be met by several different types of GPU, in the listed order of +preference. Each of those is a separate `DeviceRequest`, with both a +`deviceClassName` and also its own name. The fact that these subrequests also +have their own names allows us to apply constraints or configuration to +specific, individual subrequests, in the event that it is the chosen +alternative. In this example, the "small-gpu" choice requires a configuration +option that the other two choices do not need. Thus, if the resolution of the +"gpu" request is made using the "small-gpu" subrequest, then that configuration +will be attached to the allocation. Otherwise, it will not. + +Similarly, for `Constraints`, the list of requests can include the main request +name ("gpu" in this case), in which case the constraint applies regardless of +which alternative is chosen. Or, it can include the subrequest name, in which +case that constraint only applies if that particular subrequest is chosen. In the PodSpec, however, the subrequest names are not valid. Only the main request name may be used. @@ -413,11 +439,11 @@ request name may be used. ### Resource Quota ResourceQuota will be enforced such that the user must have quota for each -`DeviceRequest` under every `RankedDeviceRequest`. Thus, this "pick one" -behavior cannot be used to circumvent quota. This reduces the usefuleness of the -feature, as it no longer services as a quota-management feature. However, the -primary goal of the feature is about flexibility across clusters and -obtainability of underlying devices, not quota management. +`DeviceRequest` under every `FirstOf`. Thus, this "pick one" behavior cannot be +used to circumvent quota. This reduces the usefulness of the feature, as it +means it will not serve as a quota management feature. However, the primary goal +of the feature is about flexibility across clusters and obtainability of +underlying devices, not quota management. ### User Stories (Optional) From 338dde9842ed1a200bc96ca186d4d91a1440eb34 Mon Sep 17 00:00:00 2001 From: John Belamaric Date: Fri, 27 Sep 2024 11:49:14 -0700 Subject: [PATCH 06/20] typo --- keps/sig-node/4816-dra-prioritized-list/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/4816-dra-prioritized-list/README.md b/keps/sig-node/4816-dra-prioritized-list/README.md index 3c27d1fe016..62c40e1adac 100644 --- a/keps/sig-node/4816-dra-prioritized-list/README.md +++ b/keps/sig-node/4816-dra-prioritized-list/README.md @@ -250,7 +250,7 @@ nitty-gritty. The proposal adds a new field to the `DeviceRequest`, called `FirstOf` which will contain an ordered list of `DeviceRequest` objects. In order to satisfy the main (containing) request, exactly one of the requests listed in `FirstOf` must -be satisfied. They order listed is considered a priority order, such that the +be satisfied. The order listed is considered a priority order, such that the scheduler will only try to use the second item in the list if it is unable to satsify the first item, and so on. From 010e7e193477f26f77c27ccba05f79d0d016b2fe Mon Sep 17 00:00:00 2001 From: John Belamaric Date: Mon, 7 Oct 2024 12:52:56 -0700 Subject: [PATCH 07/20] Fill in the rest of the KEP --- .../4816-dra-prioritized-list/README.md | 290 +++++++++++------- 1 file changed, 179 insertions(+), 111 deletions(-) diff --git a/keps/sig-node/4816-dra-prioritized-list/README.md b/keps/sig-node/4816-dra-prioritized-list/README.md index 62c40e1adac..ecb24de9081 100644 --- a/keps/sig-node/4816-dra-prioritized-list/README.md +++ b/keps/sig-node/4816-dra-prioritized-list/README.md @@ -18,11 +18,11 @@ To get started with this template: - [x] **Fill out as much of the kep.yaml file as you can.** At minimum, you should fill in the "Title", "Authors", "Owning-sig", "Status", and date-related fields. -- [ ] **Fill out this file as best you can.** +- [x] **Fill out this file as best you can.** At minimum, you should fill in the "Summary" and "Motivation" sections. These should be easy if you've preflighted the idea of the KEP with the appropriate SIG(s). -- [ ] **Create a PR for this KEP.** +- [x] **Create a PR for this KEP.** Assign it to people in the SIG who are sponsoring this process. - [ ] **Merge early and iterate.** Avoid getting hung up on specific details and instead aim to get the goals of @@ -129,16 +129,16 @@ checklist items _must_ be updated for the enhancement to be released. Items marked with (R) are required *prior to targeting to a milestone / release*. -- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) - [ ] (R) KEP approvers have approved the KEP status as `implementable` -- [ ] (R) Design details are appropriately documented -- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) +- [x] (R) Design details are appropriately documented +- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) - [ ] e2e Tests for all Beta API Operations (endpoints) - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free -- [ ] (R) Graduation criteria is in place +- [x] (R) Graduation criteria is in place - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) -- [ ] (R) Production readiness review completed +- [x] (R) Production readiness review completed - [ ] (R) Production readiness review approved - [ ] "Implementation History" section is up-to-date for milestone - [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] @@ -247,17 +247,17 @@ The "Design Details" section below is for the real nitty-gritty. --> -The proposal adds a new field to the `DeviceRequest`, called `FirstOf` which -will contain an ordered list of `DeviceRequest` objects. In order to satisfy the -main (containing) request, exactly one of the requests listed in `FirstOf` must -be satisfied. The order listed is considered a priority order, such that the -scheduler will only try to use the second item in the list if it is unable to -satsify the first item, and so on. - -A `DeviceRequest` that populates the `FirstOf` field must *not* populate the -`DeviceClassName` field. The `required` validation on this field will be -relaxed. This allows existing clients to differentiate between claims they -understand (with `DeviceClassName`) and those they do not (without +The proposal adds a new field to the `DeviceRequest`, called `FirstAvailableOf` +which will contain an ordered list of `DeviceRequest` objects. In order to +satisfy the main (containing) request, exactly one of the requests listed in +`FirstAvailableOf` must be satisfied. The order listed is considered a priority +order, such that the scheduler will only try to use the second item in the list +if it is unable to satsify the first item, and so on. + +A `DeviceRequest` that populates the `FirstAvailableOf` field must *not* +populate the `DeviceClassName` field. The `required` validation on this field +will be relaxed. This allows existing clients to differentiate between claims +they understand (with `DeviceClassName`) and those they do not (without `DeviceClassName` but with the new field). Clients written for 1.31, when `DeviceClassName` was required, were requested to include this logic, and the in-tree components have been built in this way. @@ -279,9 +279,9 @@ type DeviceRequest struct { // additional configuration and selectors to be inherited by this // request. // - // Either a class or FirstOf requests are required in DeviceClaim.Requests. - // When this request is part of the FirstOf list, a class is required. Nested - // FirstOf requests are not allowed + // Either a class or FirstAvailableOf requests are required in DeviceClaim.Requests. + // When this request is part of the FirstAvailableOf list, a class is required. Nested + // FirstAvailableOf requests are not allowed // // Which classes are available depends on the cluster. // @@ -295,14 +295,14 @@ type DeviceRequest struct { // +oneOf=deviceRequestType DeviceClassName string - // FirstOf contains subrequests, exactly one of which must be satisfied + // FirstAvailableOf contains subrequests, exactly one of which must be satisfied // in order to satisfy this request. This field may only be set in the // entries of DeviceClaim.Requests. It must not be set in DeviceRequest - // instances that themselves are part of a FirstOf. + // instances that themselves are part of a FirstAvailableOf. // // +optional // +oneOf=deviceRequestType - FirstOf []DeviceRequest + FirstAvailableOf []DeviceRequest // Selectors define criteria which must be satisfied by a specific // device in order for that device to be considered for this @@ -353,8 +353,8 @@ type DeviceRequest struct { } const ( - DeviceSelectorsMaxSize = 32 - FirstOfDeviceRequestMaxSize = 8 + DeviceSelectorsMaxSize = 32 + FirstAvailableOfDeviceRequestMaxSize = 8 ) ``` @@ -439,11 +439,11 @@ request name may be used. ### Resource Quota ResourceQuota will be enforced such that the user must have quota for each -`DeviceRequest` under every `FirstOf`. Thus, this "pick one" behavior cannot be -used to circumvent quota. This reduces the usefulness of the feature, as it -means it will not serve as a quota management feature. However, the primary goal -of the feature is about flexibility across clusters and obtainability of -underlying devices, not quota management. +`DeviceRequest` under every `FirstAvailableOf`. Thus, this "pick one" behavior +cannot be used to circumvent quota. This reduces the usefulness of the feature, +as it means it will not serve as a quota management feature. However, the +primary goal of the feature is about flexibility across clusters and +obtainability of underlying devices, not quota management. ### User Stories (Optional) @@ -503,7 +503,7 @@ when drafting this test plan. [testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md --> -[ ] I/we understand the owners of the involved components may require updates to +[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement. @@ -535,7 +535,9 @@ This can inform certain test coverage improvements that we want to do before extending the production code to implement this enhancement. --> -- ``: `` - `` +- `k8s.io/kubernetes/pkg/scheduler`: TBD +- `k8s.io/kubernetes/pkg/scheduler/framework`: TBD +- `k8s.io/kubernetes/pkg/controller`: TBD ##### Integration tests @@ -554,7 +556,14 @@ For Beta and GA, add links to added tests together with links to k8s-triage for https://storage.googleapis.com/k8s-triage/index.html --> -- : +The existing [integration tests for kube-scheduler which measure +performance](https://github.com/kubernetes/kubernetes/tree/master/test/integration/scheduler_perf#readme) +will be extended to cover the overheaad of running the additional logic to +support the features in this KEP. These also serve as [correctness +tests](https://github.com/kubernetes/kubernetes/commit/cecebe8ea2feee856bc7a62f4c16711ee8a5f5d9) +as part of the normal Kubernetes "integration" jobs which cover [the dynamic +resource +controller](https://github.com/kubernetes/kubernetes/blob/294bde0079a0d56099cf8b8cf558e3ae7230de12/test/integration/scheduler_perf/util.go#L135-L139). ##### e2e tests @@ -568,37 +577,15 @@ https://storage.googleapis.com/k8s-triage/index.html We expect no non-infra related flakes in the last month as a GA graduation criteria. --> -- : +End-to-end testing depends on a working resource driver and a container runtime +with CDI support. A [test +driver](https://github.com/kubernetes/kubernetes/tree/master/test/e2e/dra/test-driver) +was developed as part of the overall DRA development effort. We will extend this +test driver to enable support for alternative device requests and add tests to +ensure they are handled by the scheduler as described in this KEP. ### Graduation Criteria - - ### Upgrade / Downgrade Strategy - +Standard upgrade/downgrade strategies may be used, no special configuration +changes are needed. There are no kubelet or DRA-driver changes for this feature, +they are all local to the control plane. ### Version Skew Strategy - +The proposed API change relaxes a `required` constraint on the +`DeviceRequest.DeviceClassName` field. The `DeviceRequest` thus becomes a one-of +that must have either the `DeviceClassName` or the `FirstAvailableOf` field +populated. + +Older clients have been advised in the current implementation to check this +field, even though it is required, and fail to allocate a claim that does not +have the field set. This means that during rollout, if the API server has this +feature, but the scheduler does not, the scheduler will fail to schedule pods +that utilize the feature. The pod will be scheduled later according to the new +functionality after the scheduler is upgraded. + +This feature affects the specific allocations that get made by the scheduler. +Those allocations are stored in the `ResourceClaim` status, and will be acted +upon by the kubelet and DRA-driver just as if the user had made the request +without this feature. Thus, there is no impact on the data plane version skew; +if the selected request could be satisfied by the data plane without this +feature, it will work exactly the same with this feature. ## Production Readiness Review Questionnaire @@ -705,9 +670,15 @@ well as the [existing list] of feature gates. [existing list]: https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/ --> -- [ ] Feature gate (also fill in values in `kep.yaml`) - - Feature gate name: +This is an add-on on top of the `DynamicResourceAllocation` feature gate, which +also must be enabled for this feature to work. + +- [x] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: DRAFirstAvailableOf - Components depending on the feature gate: + - kube-apiserver + - kube-scheduler + - kube-controller-manager - [ ] Other - Describe the mechanism: - Will enabling / disabling the feature require downtime of the control @@ -722,6 +693,8 @@ Any change of default behavior may be surprising to users or break existing automations, so be extremely careful here. --> +No. + ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? +Yes. No existing claims or running pods will be affected. This feature affects +only the allocation of devices during scheduling. + +If a workload controller or Pod uses a `ResourceClaimTemplate` that includes +this feature, it could happen that a new Pod may be created and need to be +scheduled, even though the feature is disabled. In this case, the new Pod will +fail to schedule, as the corresponding `ResourceClaim` will not be able to be +created. + ###### What happens if we reenable the feature if it was previously rolled back? +The feature will begin working again for future scheduling choices that make use +of it. For `Deployments` or other users of `ResourceClaimTemplate`, previously +failing Pod creations or scheduling may begin to succeed. + ###### Are there any tests for feature enablement/disablement? +Unit tests will be written to validate the enablement and disablement behavior, +as well as type conversions for the new field and relaxed validation. + ### Rollout, Upgrade and Rollback Planning +Will consider in the beta timeframe. + ###### What specific metrics should inform a rollback? +Will consider in the beta timeframe. + ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? +Will consider in the beta timeframe. + ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? +No, though we do relax validation on one field to make it no longer a required +field. + ### Monitoring Requirements +Will consider in the beta timeframe. + ###### How can someone using this feature know that it is working for their instance? +Will consider in the beta timeframe. + - [ ] Events - Event Reason: - [ ] API .status @@ -844,12 +846,16 @@ These goals will help you determine what you need to measure (SLIs) in the next question. --> +Existing DRA and related SLOs continue to apply. + ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? +Will consider in the beta timeframe. + - [ ] Metrics - Metric name: - [Optional] Aggregation method: @@ -864,6 +870,9 @@ Describe the metrics themselves and the reasons why they weren't added (e.g., co implementation difficulties, etc.). --> +We can consider a histogram metric showing how many allocations are made from +indices 0-7 of ResourceClaims that utilize this feature. + ### Dependencies +This feature depends on the DRA structured parameters feature being enabled, and +on DRA drivers being deployed. There are no requirements beyond those already +needed for DRA structured parameters. + ### Scalability +No. + ###### Will enabling / using this feature result in introducing new API types? +No, just a new field on the `ResourceClaim.DeviceRequest` struct. + ###### Will enabling / using this feature result in any new calls to the cloud provider? +No. + ###### Will enabling / using this feature result in increasing size or count of the existing API objects? +Yes, when using this field, the user will add additional data in their +`ResourceClaim` and `ResourceClaimTemplate` objects. This is an incremental +increase on top of the existing structures. The number of alternate requests is +limited to 8 in order to minimize the potential object size. + ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? +Scheduling a claim that uses this feature may take a bit longer, if it is +necessary to go deeper into the list of alternative options before finding a +suitable device. We can measure this impact in alpha. + ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? +No. + ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? +No. + ### Troubleshooting -### Resource Claim Indirection +### Higher Level Indirection Rather than embedding a list of alternative request objects, we could use an -umbrella `ResourceClaim` that instead references other `ResourceClaim`s. +indirection at either the `ResourceClaim` level, or the `DeviceClaim` level. +For example, we could create a new resource claim type by adding a +`FirstOfDevices` list to the `ResourceClaimSpec`, and making it a one-of with +`Devices`. + +Something like this: + +```go +// ResourceClaimSpec defines what is being requested in a ResourceClaim and how to configure it. +type ResourceClaimSpec struct { + // Devices defines how to request devices. + // + // oneOf: claimType + // +optional + Devices DeviceClaim + + // FirstOfDevices defines devices to claim in a + // + // oneOf: claimType + // +optional + FirstOfDevices []DeviceClaim + + // + // Must be a DNS subdomain and should end with a DNS domain owned by the + // vendor of the driver. + // + // This is an alpha field and requires enabling the DRAControlPlaneController + // feature gate. + // + // +optional + // +featureGate=DRAControlPlaneController + Controller string +} +``` + +This is arguably simpler and allows them to be essentially complete, alternate +claims. ## Infrastructure Needed (Optional) From 8471149917357557b6082e5ca0a004385b4f3c0f Mon Sep 17 00:00:00 2001 From: John Belamaric Date: Mon, 7 Oct 2024 12:56:04 -0700 Subject: [PATCH 08/20] update field name in example --- keps/sig-node/4816-dra-prioritized-list/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/4816-dra-prioritized-list/README.md b/keps/sig-node/4816-dra-prioritized-list/README.md index ecb24de9081..0b44bbcd8b6 100644 --- a/keps/sig-node/4816-dra-prioritized-list/README.md +++ b/keps/sig-node/4816-dra-prioritized-list/README.md @@ -371,7 +371,7 @@ spec: - name: nic deviceClassName: rdma-nic - name: gpu - firstOf: + firstAvailableOf: - name: big-gpu deviceClassName: big-gpu - name: mid-gpu From 33cdc03b073ac1376e524cbb0d568204bbba5ec9 Mon Sep 17 00:00:00 2001 From: John Belamaric Date: Mon, 7 Oct 2024 13:37:56 -0700 Subject: [PATCH 09/20] Re-organize and add a bit --- .../4816-dra-prioritized-list/README.md | 150 +++++++++++------- 1 file changed, 97 insertions(+), 53 deletions(-) diff --git a/keps/sig-node/4816-dra-prioritized-list/README.md b/keps/sig-node/4816-dra-prioritized-list/README.md index 0b44bbcd8b6..370186deae1 100644 --- a/keps/sig-node/4816-dra-prioritized-list/README.md +++ b/keps/sig-node/4816-dra-prioritized-list/README.md @@ -247,6 +247,11 @@ The "Design Details" section below is for the real nitty-gritty. --> +The `ResourceClaim` object contains a `DeviceClaim`, which in turn contains a +list of `DeviceRequest` objects. This allows the user to allocate different +types of devices for the same claim, and apply constraints and configuration +across those different requests. + The proposal adds a new field to the `DeviceRequest`, called `FirstAvailableOf` which will contain an ordered list of `DeviceRequest` objects. In order to satisfy the main (containing) request, exactly one of the requests listed in @@ -254,6 +259,79 @@ satisfy the main (containing) request, exactly one of the requests listed in order, such that the scheduler will only try to use the second item in the list if it is unable to satsify the first item, and so on. +This allows some flexibility for the user to create, say, a "gpu" request, but +allow it to be satisfied by one of several models of GPU. + +### User Stories (Optional) + + + +#### Story 1 + +As a workload author, I want to run a workload that needs a GPU. The workoad +itself can work with a few different models of GPU, but may need different +numbers of them depending on the model chosen. If the latest model is available +in my cluster, I would like to use that, but if it is not I am willing to take +a model one generation older. If none of those are available, I am willing to +take two GPUs of an even older model. + +#### Story 2 + +As a workload author, I want to distribute the manifests of my workloads online. +However, there are many different models of device out there, and so I do not +want to be too prescriptive in how I define my manifest. If I make it too +detailed, then I will either need multiple versions or the users will have to +edit the manifest. Instead, I would like to provide some optionality in the +types of devices that can meet my workload's needs. For best performance though, +I do have a preferred ordering of devices. + +### Notes/Constraints/Caveats (Optional) + + + +#### Resource Quota + +ResourceQuota will be enforced such that the user must have quota for each +`DeviceRequest` under every `FirstAvailableOf`. Thus, this "pick one" behavior +cannot be used to circumvent quota. This reduces the usefulness of the feature, +as it means it will not serve as a quota management feature. However, the +primary goal of the feature is about flexibility across clusters and +obtainability of underlying devices, not quota management. + + +### Risks and Mitigations + + + +## Design Details + + + A `DeviceRequest` that populates the `FirstAvailableOf` field must *not* populate the `DeviceClassName` field. The `required` validation on this field will be relaxed. This allows existing clients to differentiate between claims @@ -436,59 +514,25 @@ case that constraint only applies if that particular subrequest is chosen. In the PodSpec, however, the subrequest names are not valid. Only the main request name may be used. -### Resource Quota - -ResourceQuota will be enforced such that the user must have quota for each -`DeviceRequest` under every `FirstAvailableOf`. Thus, this "pick one" behavior -cannot be used to circumvent quota. This reduces the usefulness of the feature, -as it means it will not serve as a quota management feature. However, the -primary goal of the feature is about flexibility across clusters and -obtainability of underlying devices, not quota management. - -### User Stories (Optional) - - - -#### Story 1 - -#### Story 2 - -### Notes/Constraints/Caveats (Optional) - - - -### Risks and Mitigations - - - -## Design Details - - +### Scheduler Implementation + +Currently, the scheduler loops through each entry in `DeviceClaim.Requests` and +tries to satisfy each one. This would work essentially the same, except that +today, it [throws an +error](https://github.com/kubernetes/kubernetes/blob/03f134461462f86239067ec20ec17a0ba892db52/staging/src/k8s.io/dynamic-resource-allocation/structured/allocator.go#L164) +when it encounters a claim with a missing `DeviceClassName`. Instead, here we +would check for entries in `FirstAvailableOf`, and add an additional loop, +trying each of these requests in order. + +The current implementation will navigate a depth-first search of the devices, +trying to satisfy all requests and contraints of all claims. The optionality +offered at the `DeviceRequest` level provides another index state to track in +the +[`requestIndices`](https://github.com/kubernetes/kubernetes/blob/03f134461462f86239067ec20ec17a0ba892db52/staging/src/k8s.io/dynamic-resource-allocation/structured/allocator.go#L362) and [`deviceIndices`](https://github.com/kubernetes/kubernetes/blob/03f134461462f86239067ec20ec17a0ba892db52/staging/src/k8s.io/dynamic-resource-allocation/structured/allocator.go#L368). In the case of the feature gate +disabled, this new index will always be 0. + +Alternatively, we can refactor to make this code more defensible via a feature +gate. ### Test Plan From be36daa3831fc9019e7b39dbae063108ffb1235a Mon Sep 17 00:00:00 2001 From: John Belamaric Date: Mon, 7 Oct 2024 13:40:57 -0700 Subject: [PATCH 10/20] update kep.yaml for latest changes --- keps/sig-node/4816-dra-prioritized-list/kep.yaml | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/keps/sig-node/4816-dra-prioritized-list/kep.yaml b/keps/sig-node/4816-dra-prioritized-list/kep.yaml index cead6a0c478..76dd2cbb9f5 100644 --- a/keps/sig-node/4816-dra-prioritized-list/kep.yaml +++ b/keps/sig-node/4816-dra-prioritized-list/kep.yaml @@ -38,12 +38,11 @@ milestone: # The following PRR answers are required at alpha release # List the feature gate name and the components for which it must be enabled feature-gates: - - name: DRAPrioritizedList + - name: DRAFirstAvailableOf components: - kube-apiserver - kube-controller-manager - kube-scheduler - - kubelet disable-supported: true # The following PRR answers are required at beta release From 3845eaffd5d9c0886713e30885206e717a0d901a Mon Sep 17 00:00:00 2001 From: John Belamaric Date: Mon, 7 Oct 2024 18:24:36 -0700 Subject: [PATCH 11/20] Remove some unchanged API fields to make it easier to read --- .../4816-dra-prioritized-list/README.md | 67 +++---------------- 1 file changed, 10 insertions(+), 57 deletions(-) diff --git a/keps/sig-node/4816-dra-prioritized-list/README.md b/keps/sig-node/4816-dra-prioritized-list/README.md index 370186deae1..60613bea0db 100644 --- a/keps/sig-node/4816-dra-prioritized-list/README.md +++ b/keps/sig-node/4816-dra-prioritized-list/README.md @@ -382,52 +382,7 @@ type DeviceRequest struct { // +oneOf=deviceRequestType FirstAvailableOf []DeviceRequest - // Selectors define criteria which must be satisfied by a specific - // device in order for that device to be considered for this - // request. All selectors must be satisfied for a device to be - // considered. - // - // +optional - // +listType=atomic - Selectors []DeviceSelector - - // AllocationMode and its related fields define how devices are allocated - // to satisfy this request. Supported values are: - // - // - ExactCount: This request is for a specific number of devices. - // This is the default. The exact number is provided in the - // count field. - // - // - All: This request is for all of the matching devices in a pool. - // Allocation will fail if some devices are already allocated, - // unless adminAccess is requested. - // - // If AlloctionMode is not specified, the default mode is ExactCount. If - // the mode is ExactCount and count is not specified, the default count is - // one. Any other requests must specify this field. - // - // More modes may get added in the future. Clients must refuse to handle - // requests with unknown modes. - // - // +optional - AllocationMode DeviceAllocationMode - - // Count is used only when the count mode is "ExactCount". Must be greater than zero. - // If AllocationMode is ExactCount and this field is not specified, the default is one. - // - // +optional - // +oneOf=AllocationMode - Count int64 - - // AdminAccess indicates that this is a claim for administrative access - // to the device(s). Claims with AdminAccess are expected to be used for - // monitoring or other management services for a device. They ignore - // all ordinary claims to the device with respect to access modes and - // any resource allocations. - // - // +optional - // +default=false - AdminAccess bool + ... } const ( @@ -1102,6 +1057,10 @@ Major milestones might include: Why should this KEP _not_ be implemented? --> +This adds complexity to the scheduler and to the cluster autoscaler, which will +simulate the satisfaction of claims with different node shapes. + + ## Alternatives @@ -257,7 +262,7 @@ which will contain an ordered list of `DeviceRequest` objects. In order to satisfy the main (containing) request, exactly one of the requests listed in `FirstAvailableOf` must be satisfied. The order listed is considered a priority order, such that the scheduler will only try to use the second item in the list -if it is unable to satsify the first item, and so on. +if it is unable to satisfy the first item, and so on. This allows some flexibility for the user to create, say, a "gpu" request, but allow it to be satisfied by one of several models of GPU. From a27146928507058f243d6bec0b1f398f658968ea Mon Sep 17 00:00:00 2001 From: John Belamaric Date: Tue, 8 Oct 2024 09:54:49 -0700 Subject: [PATCH 13/20] Review feedback, implementable --- keps/sig-node/4816-dra-prioritized-list/README.md | 6 ++++++ keps/sig-node/4816-dra-prioritized-list/kep.yaml | 2 +- 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/keps/sig-node/4816-dra-prioritized-list/README.md b/keps/sig-node/4816-dra-prioritized-list/README.md index 90fd34832ee..22451f89896 100644 --- a/keps/sig-node/4816-dra-prioritized-list/README.md +++ b/keps/sig-node/4816-dra-prioritized-list/README.md @@ -721,6 +721,12 @@ scheduled, even though the feature is disabled. In this case, the new Pod will fail to schedule, as the corresponding `ResourceClaim` will not be able to be created. +The recommendation is to remove any usage of this feature in both +`ResourceClaim`s and `ResourceClaimTemplate`s when disabling the feature, and +force the workloads to use a specific device request instead. This will ensure +that there are no unexpected failures later, if a Pod gets rescheduled to +another node or recreated for some reason. + ###### What happens if we reenable the feature if it was previously rolled back? The feature will begin working again for future scheduling choices that make use diff --git a/keps/sig-node/4816-dra-prioritized-list/kep.yaml b/keps/sig-node/4816-dra-prioritized-list/kep.yaml index 76dd2cbb9f5..5fb4b7a3e10 100644 --- a/keps/sig-node/4816-dra-prioritized-list/kep.yaml +++ b/keps/sig-node/4816-dra-prioritized-list/kep.yaml @@ -6,7 +6,7 @@ owning-sig: sig-node participating-sigs: - sig-scheduling - sig-autoscaling -status: provisional +status: implementable creation-date: 2024-09-24 reviewers: - "@pohly" From 6cf172ec46fb0c765acb86005e91fd9138c74c4b Mon Sep 17 00:00:00 2001 From: John Belamaric Date: Tue, 8 Oct 2024 17:11:00 -0700 Subject: [PATCH 14/20] Review feedback --- keps/sig-node/4816-dra-prioritized-list/README.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/keps/sig-node/4816-dra-prioritized-list/README.md b/keps/sig-node/4816-dra-prioritized-list/README.md index 22451f89896..467db81001b 100644 --- a/keps/sig-node/4816-dra-prioritized-list/README.md +++ b/keps/sig-node/4816-dra-prioritized-list/README.md @@ -494,6 +494,14 @@ disabled, this new index will always be 0. Alternatively, we can refactor to make this code more defensible via a feature gate. +DRA today works on a "first match" basis for a given node. That would not change +with this KEP. However, in order for the scheduler to prefer a node that has the +initial prioritized device request, those requests would need a higher score. +This will be implemented for beta. For alpha, the scheduler may still pick a +node with a less preferred device, if there are nodes with each type of device +available. + + ### Test Plan +### User Stories #### Story 1 @@ -295,14 +288,7 @@ edit the manifest. Instead, I would like to provide some optionality in the types of devices that can meet my workload's needs. For best performance though, I do have a preferred ordering of devices. -### Notes/Constraints/Caveats (Optional) - - +### Notes/Constraints/Caveats #### Resource Quota @@ -501,7 +487,6 @@ This will be implemented for beta. For alpha, the scheduler may still pick a node with a less preferred device, if there are nodes with each type of device available. - ### Test Plan -- `k8s.io/kubernetes/pkg/scheduler`: TBD -- `k8s.io/kubernetes/pkg/scheduler/framework`: TBD -- `k8s.io/kubernetes/pkg/controller`: TBD + + +Start of v1.32 development cycle (v1.32.0-alpha.1-178-gd9c46d8ecb1): + +- `k8s.io/dynamic-resource-allocation/cel`: 88.8% +- `k8s.io/dynamic-resource-allocation/structured`: 82.7% +- `k8s.io/kubernetes/pkg/controller/resourceclaim`: 70.0% +- `k8s.io/kubernetes/pkg/scheduler/framework/plugins/dynamicresources`: 72.9% ##### Integration tests From 51e4db71bcc8cfe1207a5ee80100cc426c709c26 Mon Sep 17 00:00:00 2001 From: John Belamaric Date: Wed, 9 Oct 2024 11:08:56 -0700 Subject: [PATCH 18/20] Review feedback --- .../sig-scheduling/4816-dra-prioritized-list/README.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/keps/sig-scheduling/4816-dra-prioritized-list/README.md b/keps/sig-scheduling/4816-dra-prioritized-list/README.md index 5df673b0cdc..d6a694c02e9 100644 --- a/keps/sig-scheduling/4816-dra-prioritized-list/README.md +++ b/keps/sig-scheduling/4816-dra-prioritized-list/README.md @@ -135,7 +135,7 @@ checklist items _must_ be updated for the enhancement to be released. Items marked with (R) are required *prior to targeting to a milestone / release*. - [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) -- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [x] (R) KEP approvers have approved the KEP status as `implementable` - [x] (R) Design details are appropriately documented - [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) - [ ] e2e Tests for all Beta API Operations (endpoints) @@ -144,8 +144,8 @@ Items marked with (R) are required *prior to targeting to a milestone / release* - [x] (R) Graduation criteria is in place - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) - [x] (R) Production readiness review completed -- [ ] (R) Production readiness review approved -- [ ] "Implementation History" section is up-to-date for milestone +- [x] (R) Production readiness review approved +- [x] "Implementation History" section is up-to-date for milestone - [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] - [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes @@ -601,7 +601,7 @@ ensure they are handled by the scheduler as described in this KEP. - Gather feedback - Implement node scoring -- Cluster auto scaler implementation +- Evaluate feasibilty of cluster auto scaler implementation - Additional tests are in Testgrid and linked in KEP #### GA @@ -1066,6 +1066,8 @@ Major milestones might include: - when the KEP was retired or superseded --> +1.32 Enhancements Freeze - KEP merged, alpha implementation initiated + ## Drawbacks