KEP-4816: DRA: Prioritized Alternatives in Device Requests

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
- Higher Level Indirection
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
"Implementation History" section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

The DRA Structured Parameters feature has added the ability to make requests for very specific types of devices using a ResourceClaim. However, the current API does not allow the user to indicate any priority when multiple types or configurations of devices may meet the needs of the workload. This feature allows the user to specify alternative requests that statisfy the workloads need, giving the scheduler more flexiblity in scheduling the workload.

Motivation

"Obtainability" of certain types of scarce resources is a primary concern of many AI/ML users. GPUs are in high demand, particularly the latest models. This means that workloads that use DRA to specify a need for particular types of GPUs may fail to schedule. In practice, a workload that needs a GPU can be written such that it can discover the GPUs available to it, and work with what it is given. A user may have a preference for the latest model, but would like to run the workload even if only an older model is available.

Similarly, packaged workload authors may wish to configure a workload such that it will work well in the widest selection of available clusters. That is, a distributor of shared workload definitions would like to be able to specify alternative types of devices with which their workload will function, without requiring the user to modify the manifests.

Goals

Allow workload authors, when specifying a ResourceClaim, to provide a list of ways to satisfy the claim, with a preference ranking.
Enable schedulers to evaluate those preferences and allocate devices for the claim based on them.
Enable cluster autoscalers to evaluate those preferences and make scaling choices based on them.
Provide some measure of ResourceQuota controls when users utilize claims with these types of requests.

Non-Goals

Enable cross-claim consistency of request choices. For example, guaranteeing that all ResourceClaims associated with a given Deployment are satisfied using the same choice from the list of possible alternatives.

Proposal

The ResourceClaim object contains a DeviceClaim, which in turn contains a list of DeviceRequest objects. This allows the user to allocate different types of devices for the same claim, and apply constraints and configuration across those different requests.

The proposal adds a new field to the DeviceRequest, called FirstAvailableOf which will contain an ordered list of DeviceRequest objects. In order to satisfy the main (containing) request, exactly one of the requests listed in FirstAvailableOf must be satisfied. The order listed is considered a priority order, such that the scheduler will only try to use the second item in the list if it is unable to satisfy the first item, and so on.

This allows some flexibility for the user to create, say, a "gpu" request, but allow it to be satisfied by one of several models of GPU.

User Stories

Story 1

As a workload author, I want to run a workload that needs a GPU. The workoad itself can work with a few different models of GPU, but may need different numbers of them depending on the model chosen. If the latest model is available in my cluster, I would like to use that, but if it is not I am willing to take a model one generation older. If none of those are available, I am willing to take two GPUs of an even older model.

Story 2

As a workload author, I want to distribute the manifests of my workloads online. However, there are many different models of device out there, and so I do not want to be too prescriptive in how I define my manifest. If I make it too detailed, then I will either need multiple versions or the users will have to edit the manifest. Instead, I would like to provide some optionality in the types of devices that can meet my workload's needs. For best performance though, I do have a preferred ordering of devices.

Notes/Constraints/Caveats

Resource Quota

ResourceQuota will be enforced such that the user must have quota for each DeviceRequest under every FirstAvailableOf. Thus, this "pick one" behavior cannot be used to circumvent quota. This reduces the usefulness of the feature, as it means it will not serve as a quota management feature. However, the primary goal of the feature is about flexibility across clusters and obtainability of underlying devices, not quota management.

Risks and Mitigations

Design Details

A DeviceRequest that populates the FirstAvailableOf field must not populate the DeviceClassName field. The required validation on this field will be relaxed. This allows existing clients to differentiate between claims they understand (with DeviceClassName) and those they do not (without DeviceClassName but with the new field). Clients written for 1.31, when DeviceClassName was required, were requested to include this logic, and the in-tree components have been built in this way.

// DeviceRequest is a request for devices required for a claim.
// This is typically a request for a single resource like a device, but can
// also ask for several identical devices.
type DeviceRequest struct {
    // Name can be used to reference this request in a pod.spec.containers[].resources.claims
    // entry and in a constraint of the claim.
    //
    // Must be a DNS label.
    //
    // +required
    Name string

    // DeviceClassName references a specific DeviceClass, which can define
    // additional configuration and selectors to be inherited by this
    // request.
    //
    // Either a class or FirstAvailableOf requests are required in DeviceClaim.Requests.
    // When this request is part of the FirstAvailableOf list, a class is required. Nested
    // FirstAvailableOf requests are not allowed
    //
    // Which classes are available depends on the cluster.
    //
    // Administrators may use this to restrict which devices may get
    // requested by only installing classes with selectors for permitted
    // devices. If users are free to request anything without restrictions,
    // then administrators can create an empty DeviceClass for users
    // to reference.
    //
    // +optional
    // +oneOf=deviceRequestType
    DeviceClassName string

    // FirstAvailableOf contains subrequests, exactly one of which must be satisfied
    // in order to satisfy this request. This field may only be set in the
    // entries of DeviceClaim.Requests. It must not be set in DeviceRequest
    // instances that themselves are part of a FirstAvailableOf.
    //
    // +optional
    // +oneOf=deviceRequestType
    FirstAvailableOf []DeviceRequest

    ...
}

const (
    DeviceSelectorsMaxSize               = 32
    FirstAvailableOfDeviceRequestMaxSize = 8
)

Let's take a look at an example.

apiVersion: resource.k8s.io/v1alpha4
kind: ResourceClaim
metadata:
  name: device-consumer-claim
spec:
  devices:
    requests:
    - name: nic
      deviceClassName: rdma-nic
    - name: gpu
      firstAvailableOf:
      - name: big-gpu
        deviceClassName: big-gpu
      - name: mid-gpu
        deviceClassName: mid-gpu
      - name: small-gpu
        deviceClassName: small-gpu
        count: 2
    constraints:
    - requests: ["nic", gpu"]
      matchAttribute:
      - dra.k8s.io/pcieRoot
    config:
    - requests: ["small-gpu"]
      opaque:
        driver: gpu.acme.example.com
        parameters:
          apiVersion: gpu.acme.example.com/v1
          kind: GPUConfig
          mode: multipleGPUs
---
apiVersion: v1
kind: Pod
metadata:
  name: device-consumer
spec:
  resourceClaims:
  - name: "gpu-and-nic"
    resourceClaimName: device-consumer-claim
  containers:
  - name: workload
    image: my-app
    command: ["/bin/program"]
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"
        cpu: "500m"
      claims:
      - name: "gpu-and-nic"
        request: "gpu" # the 'nic' request is pod-level, no need to attach to container

There are a few things to note here. First, the "nic" request is listed with a deviceClassName, because it has no alternative request types. The "gpu" request could be met by several different types of GPU, in the listed order of preference. Each of those is a separate DeviceRequest, with both a deviceClassName and also its own name. The fact that these subrequests also have their own names allows us to apply constraints or configuration to specific, individual subrequests, in the event that it is the chosen alternative. In this example, the "small-gpu" choice requires a configuration option that the other two choices do not need. Thus, if the resolution of the "gpu" request is made using the "small-gpu" subrequest, then that configuration will be attached to the allocation. Otherwise, it will not.

Similarly, for Constraints, the list of requests can include the main request name ("gpu" in this case), in which case the constraint applies regardless of which alternative is chosen. Or, it can include the subrequest name, in which case that constraint only applies if that particular subrequest is chosen.

In the PodSpec, however, the subrequest names are not valid. Only the main request name may be used.

Scheduler Implementation

Currently, the scheduler loops through each entry in DeviceClaim.Requests and tries to satisfy each one. This would work essentially the same, except that today, it throws an error when it encounters a claim with a missing DeviceClassName. Instead, here we would check for entries in FirstAvailableOf, and add an additional loop, trying each of these requests in order.

The current implementation will navigate a depth-first search of the devices, trying to satisfy all requests and contraints of all claims. The optionality offered at the DeviceRequest level provides another index state to track in the requestIndices and deviceIndices. In the case of the feature gate disabled, this new index will always be 0.

Alternatively, we can refactor to make this code more defensible via a feature gate.

DRA today works on a "first match" basis for a given node. That would not change with this KEP; on any given node, devices will be tried in the priority order listed in the main request, and the first fit will be returned. However, in practice, nodes typically only have one type of device that would satisfy any of the three requests. That means that individual nodes with any of the listed devices will show as valid nodes for the workload. In order for the scheduler to prefer a node that has the initial prioritized device request, those requests would need a higher score, which currently is planned for beta of this feature. For alpha, the scheduler may still pick a node with a less preferred device, if there are nodes with each type of device available.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

Start of v1.32 development cycle (v1.32.0-alpha.1-178-gd9c46d8ecb1):

k8s.io/dynamic-resource-allocation/cel: 88.8%
k8s.io/dynamic-resource-allocation/structured: 82.7%
k8s.io/kubernetes/pkg/controller/resourceclaim: 70.0%
k8s.io/kubernetes/pkg/scheduler/framework/plugins/dynamicresources: 72.9%

Integration tests

The existing integration tests for kube-scheduler which measure performance will be extended to cover the overheaad of running the additional logic to support the features in this KEP. These also serve as correctness tests as part of the normal Kubernetes "integration" jobs which cover the dynamic resource controller.

e2e tests

End-to-end testing depends on a working resource driver and a container runtime with CDI support. A test driver was developed as part of the overall DRA development effort. We will extend this test driver to enable support for alternative device requests and add tests to ensure they are handled by the scheduler as described in this KEP.

Graduation Criteria

Alpha

Feature implemented behind a feature flag
Implemented in the scheduler but not necessarily the cluster auto scaler
Initial e2e tests completed and enabled

Beta

Gather feedback
Implement node scoring
Evaluate feasibilty of cluster auto scaler implementation
Additional tests are in Testgrid and linked in KEP

GA

3 examples of real-world usage
Allowing time for feedback

Upgrade / Downgrade Strategy

Standard upgrade/downgrade strategies may be used, no special configuration changes are needed. There are no kubelet or DRA-driver changes for this feature, they are all local to the control plane.

Version Skew Strategy

The proposed API change relaxes a required constraint on the DeviceRequest.DeviceClassName field. The DeviceRequest thus becomes a one-of that must have either the DeviceClassName or the FirstAvailableOf field populated.

Older clients have been advised in the current implementation to check this field, even though it is required, and fail to allocate a claim that does not have the field set. This means that during rollout, if the API server has this feature, but the scheduler does not, the scheduler will fail to schedule pods that utilize the feature. The pod will be scheduled later according to the new functionality after the scheduler is upgraded.

This feature affects the specific allocations that get made by the scheduler. Those allocations are stored in the ResourceClaim status, and will be acted upon by the kubelet and DRA-driver just as if the user had made the request without this feature. Thus, there is no impact on the data plane version skew; if the selected request could be satisfied by the data plane without this feature, it will work exactly the same with this feature.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

This is an add-on on top of the DynamicResourceAllocation feature gate, which also must be enabled for this feature to work.

Feature gate (also fill in values in kep.yaml)
- Feature gate name: DRAFirstAvailableOf
- Components depending on the feature gate:
  - kube-apiserver
  - kube-scheduler
  - kube-controller-manager
Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?

Does enabling the feature change any default behavior?

No.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. No existing claims or running pods will be affected. This feature affects only the allocation of devices during scheduling.

If a workload controller or Pod uses a ResourceClaimTemplate that includes this feature, it could happen that a new Pod may be created and need to be scheduled, even though the feature is disabled. In this case, the new Pod will fail to schedule, as the corresponding ResourceClaim will not be able to be created.

The recommendation is to remove any usage of this feature in both ResourceClaims and ResourceClaimTemplates when disabling the feature, and force the workloads to use a specific device request instead. This will ensure that there are no unexpected failures later, if a Pod gets rescheduled to another node or recreated for some reason.

What happens if we reenable the feature if it was previously rolled back?

The feature will begin working again for future scheduling choices that make use of it. For Deployments or other users of ResourceClaimTemplate, previously failing Pod creations or scheduling may begin to succeed.

Are there any tests for feature enablement/disablement?

Unit tests will be written to validate the enablement and disablement behavior, as well as type conversions for the new field and relaxed validation.