Skip to content

Latest commit

 

History

History
1134 lines (894 loc) · 44.4 KB

File metadata and controls

1134 lines (894 loc) · 44.4 KB

KEP-4816: DRA: Prioritized Alternatives in Device Requests

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
    • e2e Tests for all Beta API Operations (endpoints)
    • (R) Ensure GA e2e tests meet requirements for Conformance Tests
    • (R) Minimum Two Week Window for GA e2e tests to prove flake free
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • "Implementation History" section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

The DRA Structured Parameters feature has added the ability to make requests for very specific types of devices using a ResourceClaim. However, the current API does not allow the user to indicate any priority when multiple types or configurations of devices may meet the needs of the workload. This feature allows the user to specify alternative requests that statisfy the workloads need, giving the scheduler more flexiblity in scheduling the workload.

Motivation

"Obtainability" of certain types of scarce resources is a primary concern of many AI/ML users. GPUs are in high demand, particularly the latest models. This means that workloads that use DRA to specify a need for particular types of GPUs may fail to schedule. In practice, a workload that needs a GPU can be written such that it can discover the GPUs available to it, and work with what it is given. A user may have a preference for the latest model, but would like to run the workload even if only an older model is available.

Similarly, packaged workload authors may wish to configure a workload such that it will work well in the widest selection of available clusters. That is, a distributor of shared workload definitions would like to be able to specify alternative types of devices with which their workload will function, without requiring the user to modify the manifests.

Goals

  • Allow workload authors, when specifying a ResourceClaim, to provide a list of ways to satisfy the claim, with a preference ranking.
  • Enable schedulers to evaluate those preferences and allocate devices for the claim based on them.
  • Enable cluster autoscalers to evaluate those preferences and make scaling choices based on them.
  • Provide some measure of ResourceQuota controls when users utilize claims with these types of requests.

Non-Goals

  • Enable cross-claim consistency of request choices. For example, guaranteeing that all ResourceClaims associated with a given Deployment are satisfied using the same choice from the list of possible alternatives.

Proposal

The ResourceClaim object contains a DeviceClaim, which in turn contains a list of DeviceRequest objects. This allows the user to allocate different types of devices for the same claim, and apply constraints and configuration across those different requests.

The proposal adds a new field to the DeviceRequest, called FirstAvailableOf which will contain an ordered list of DeviceRequest objects. In order to satisfy the main (containing) request, exactly one of the requests listed in FirstAvailableOf must be satisfied. The order listed is considered a priority order, such that the scheduler will only try to use the second item in the list if it is unable to satisfy the first item, and so on.

This allows some flexibility for the user to create, say, a "gpu" request, but allow it to be satisfied by one of several models of GPU.

User Stories

Story 1

As a workload author, I want to run a workload that needs a GPU. The workoad itself can work with a few different models of GPU, but may need different numbers of them depending on the model chosen. If the latest model is available in my cluster, I would like to use that, but if it is not I am willing to take a model one generation older. If none of those are available, I am willing to take two GPUs of an even older model.

Story 2

As a workload author, I want to distribute the manifests of my workloads online. However, there are many different models of device out there, and so I do not want to be too prescriptive in how I define my manifest. If I make it too detailed, then I will either need multiple versions or the users will have to edit the manifest. Instead, I would like to provide some optionality in the types of devices that can meet my workload's needs. For best performance though, I do have a preferred ordering of devices.

Notes/Constraints/Caveats

Resource Quota

ResourceQuota will be enforced such that the user must have quota for each DeviceRequest under every FirstAvailableOf. Thus, this "pick one" behavior cannot be used to circumvent quota. This reduces the usefulness of the feature, as it means it will not serve as a quota management feature. However, the primary goal of the feature is about flexibility across clusters and obtainability of underlying devices, not quota management.

Risks and Mitigations

Design Details

A DeviceRequest that populates the FirstAvailableOf field must not populate the DeviceClassName field. The required validation on this field will be relaxed. This allows existing clients to differentiate between claims they understand (with DeviceClassName) and those they do not (without DeviceClassName but with the new field). Clients written for 1.31, when DeviceClassName was required, were requested to include this logic, and the in-tree components have been built in this way.

// DeviceRequest is a request for devices required for a claim.
// This is typically a request for a single resource like a device, but can
// also ask for several identical devices.
type DeviceRequest struct {
    // Name can be used to reference this request in a pod.spec.containers[].resources.claims
    // entry and in a constraint of the claim.
    //
    // Must be a DNS label.
    //
    // +required
    Name string

    // DeviceClassName references a specific DeviceClass, which can define
    // additional configuration and selectors to be inherited by this
    // request.
    //
    // Either a class or FirstAvailableOf requests are required in DeviceClaim.Requests.
    // When this request is part of the FirstAvailableOf list, a class is required. Nested
    // FirstAvailableOf requests are not allowed
    //
    // Which classes are available depends on the cluster.
    //
    // Administrators may use this to restrict which devices may get
    // requested by only installing classes with selectors for permitted
    // devices. If users are free to request anything without restrictions,
    // then administrators can create an empty DeviceClass for users
    // to reference.
    //
    // +optional
    // +oneOf=deviceRequestType
    DeviceClassName string

    // FirstAvailableOf contains subrequests, exactly one of which must be satisfied
    // in order to satisfy this request. This field may only be set in the
    // entries of DeviceClaim.Requests. It must not be set in DeviceRequest
    // instances that themselves are part of a FirstAvailableOf.
    //
    // +optional
    // +oneOf=deviceRequestType
    FirstAvailableOf []DeviceRequest

    ...
}

const (
    DeviceSelectorsMaxSize               = 32
    FirstAvailableOfDeviceRequestMaxSize = 8
)

Let's take a look at an example.

apiVersion: resource.k8s.io/v1alpha4
kind: ResourceClaim
metadata:
  name: device-consumer-claim
spec:
  devices:
    requests:
    - name: nic
      deviceClassName: rdma-nic
    - name: gpu
      firstAvailableOf:
      - name: big-gpu
        deviceClassName: big-gpu
      - name: mid-gpu
        deviceClassName: mid-gpu
      - name: small-gpu
        deviceClassName: small-gpu
        count: 2
    constraints:
    - requests: ["nic", gpu"]
      matchAttribute:
      - dra.k8s.io/pcieRoot
    config:
    - requests: ["small-gpu"]
      opaque:
        driver: gpu.acme.example.com
        parameters:
          apiVersion: gpu.acme.example.com/v1
          kind: GPUConfig
          mode: multipleGPUs
---
apiVersion: v1
kind: Pod
metadata:
  name: device-consumer
spec:
  resourceClaims:
  - name: "gpu-and-nic"
    resourceClaimName: device-consumer-claim
  containers:
  - name: workload
    image: my-app
    command: ["/bin/program"]
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"
        cpu: "500m"
      claims:
      - name: "gpu-and-nic"
        request: "gpu" # the 'nic' request is pod-level, no need to attach to container

There are a few things to note here. First, the "nic" request is listed with a deviceClassName, because it has no alternative request types. The "gpu" request could be met by several different types of GPU, in the listed order of preference. Each of those is a separate DeviceRequest, with both a deviceClassName and also its own name. The fact that these subrequests also have their own names allows us to apply constraints or configuration to specific, individual subrequests, in the event that it is the chosen alternative. In this example, the "small-gpu" choice requires a configuration option that the other two choices do not need. Thus, if the resolution of the "gpu" request is made using the "small-gpu" subrequest, then that configuration will be attached to the allocation. Otherwise, it will not.

Similarly, for Constraints, the list of requests can include the main request name ("gpu" in this case), in which case the constraint applies regardless of which alternative is chosen. Or, it can include the subrequest name, in which case that constraint only applies if that particular subrequest is chosen.

In the PodSpec, however, the subrequest names are not valid. Only the main request name may be used.

Scheduler Implementation

Currently, the scheduler loops through each entry in DeviceClaim.Requests and tries to satisfy each one. This would work essentially the same, except that today, it throws an error when it encounters a claim with a missing DeviceClassName. Instead, here we would check for entries in FirstAvailableOf, and add an additional loop, trying each of these requests in order.

The current implementation will navigate a depth-first search of the devices, trying to satisfy all requests and contraints of all claims. The optionality offered at the DeviceRequest level provides another index state to track in the requestIndices and deviceIndices. In the case of the feature gate disabled, this new index will always be 0.

Alternatively, we can refactor to make this code more defensible via a feature gate.

DRA today works on a "first match" basis for a given node. That would not change with this KEP; on any given node, devices will be tried in the priority order listed in the main request, and the first fit will be returned. However, in practice, nodes typically only have one type of device that would satisfy any of the three requests. That means that individual nodes with any of the listed devices will show as valid nodes for the workload. In order for the scheduler to prefer a node that has the initial prioritized device request, those requests would need a higher score, which currently is planned for beta of this feature. For alpha, the scheduler may still pick a node with a less preferred device, if there are nodes with each type of device available.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates
Unit tests

Start of v1.32 development cycle (v1.32.0-alpha.1-178-gd9c46d8ecb1):

  • k8s.io/dynamic-resource-allocation/cel: 88.8%
  • k8s.io/dynamic-resource-allocation/structured: 82.7%
  • k8s.io/kubernetes/pkg/controller/resourceclaim: 70.0%
  • k8s.io/kubernetes/pkg/scheduler/framework/plugins/dynamicresources: 72.9%
Integration tests

The existing integration tests for kube-scheduler which measure performance will be extended to cover the overheaad of running the additional logic to support the features in this KEP. These also serve as correctness tests as part of the normal Kubernetes "integration" jobs which cover the dynamic resource controller.

e2e tests

End-to-end testing depends on a working resource driver and a container runtime with CDI support. A test driver was developed as part of the overall DRA development effort. We will extend this test driver to enable support for alternative device requests and add tests to ensure they are handled by the scheduler as described in this KEP.

Graduation Criteria

Alpha

  • Feature implemented behind a feature flag
  • Implemented in the scheduler but not necessarily the cluster auto scaler
  • Initial e2e tests completed and enabled

Beta

  • Gather feedback
  • Implement node scoring
  • Evaluate feasibilty of cluster auto scaler implementation
  • Additional tests are in Testgrid and linked in KEP

GA

  • 3 examples of real-world usage
  • Allowing time for feedback

Upgrade / Downgrade Strategy

Standard upgrade/downgrade strategies may be used, no special configuration changes are needed. There are no kubelet or DRA-driver changes for this feature, they are all local to the control plane.

Version Skew Strategy

The proposed API change relaxes a required constraint on the DeviceRequest.DeviceClassName field. The DeviceRequest thus becomes a one-of that must have either the DeviceClassName or the FirstAvailableOf field populated.

Older clients have been advised in the current implementation to check this field, even though it is required, and fail to allocate a claim that does not have the field set. This means that during rollout, if the API server has this feature, but the scheduler does not, the scheduler will fail to schedule pods that utilize the feature. The pod will be scheduled later according to the new functionality after the scheduler is upgraded.

This feature affects the specific allocations that get made by the scheduler. Those allocations are stored in the ResourceClaim status, and will be acted upon by the kubelet and DRA-driver just as if the user had made the request without this feature. Thus, there is no impact on the data plane version skew; if the selected request could be satisfied by the data plane without this feature, it will work exactly the same with this feature.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

This is an add-on on top of the DynamicResourceAllocation feature gate, which also must be enabled for this feature to work.

  • Feature gate (also fill in values in kep.yaml)
    • Feature gate name: DRAFirstAvailableOf
    • Components depending on the feature gate:
      • kube-apiserver
      • kube-scheduler
      • kube-controller-manager
  • Other
    • Describe the mechanism:
    • Will enabling / disabling the feature require downtime of the control plane?
    • Will enabling / disabling the feature require downtime or reprovisioning of a node?
Does enabling the feature change any default behavior?

No.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. No existing claims or running pods will be affected. This feature affects only the allocation of devices during scheduling.

If a workload controller or Pod uses a ResourceClaimTemplate that includes this feature, it could happen that a new Pod may be created and need to be scheduled, even though the feature is disabled. In this case, the new Pod will fail to schedule, as the corresponding ResourceClaim will not be able to be created.

The recommendation is to remove any usage of this feature in both ResourceClaims and ResourceClaimTemplates when disabling the feature, and force the workloads to use a specific device request instead. This will ensure that there are no unexpected failures later, if a Pod gets rescheduled to another node or recreated for some reason.

What happens if we reenable the feature if it was previously rolled back?

The feature will begin working again for future scheduling choices that make use of it. For Deployments or other users of ResourceClaimTemplate, previously failing Pod creations or scheduling may begin to succeed.

Are there any tests for feature enablement/disablement?

Unit tests will be written to validate the enablement and disablement behavior, as well as type conversions for the new field and relaxed validation.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

Will consider in the beta timeframe.

What specific metrics should inform a rollback?

Will consider in the beta timeframe.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Will consider in the beta timeframe.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No, though we do relax validation on one field to make it no longer a required field.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

Will consider in the beta timeframe.

How can someone using this feature know that it is working for their instance?

Will consider in the beta timeframe.

  • Events
    • Event Reason:
  • API .status
    • Condition name:
    • Other field:
  • Other (treat as last resort)
    • Details:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?

Existing DRA and related SLOs continue to apply.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Will consider in the beta timeframe.

  • Metrics
    • Metric name:
    • [Optional] Aggregation method:
    • Components exposing the metric:
  • Other (treat as last resort)
    • Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?

We can consider a histogram metric showing how many allocations are made from indices 0-7 of ResourceClaims that utilize this feature.

Dependencies

Does this feature depend on any specific services running in the cluster?

This feature depends on the DRA structured parameters feature being enabled, and on DRA drivers being deployed. There are no requirements beyond those already needed for DRA structured parameters.

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No, just a new field on the ResourceClaim.DeviceRequest struct.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

Yes, when using this field, the user will add additional data in their ResourceClaim and ResourceClaimTemplate objects. This is an incremental increase on top of the existing structures. The number of alternate requests is limited to 8 in order to minimize the potential object size.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Scheduling a claim that uses this feature may take a bit longer, if it is necessary to go deeper into the list of alternative options before finding a suitable device. We can measure this impact in alpha.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

No.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?
What are other known failure modes?
What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

1.32 Enhancements Freeze - KEP merged, alpha implementation initiated

Drawbacks

This adds complexity to the scheduler and to the cluster autoscaler, which will simulate the satisfaction of claims with different node shapes.

Alternatives

Higher Level Indirection

Rather than embedding a list of alternative request objects, we could use an indirection at either the ResourceClaim level, or the DeviceClaim level. For example, we could create a new resource claim type by adding a FirstOfDevices list to the ResourceClaimSpec, and making it a one-of with Devices.

Something like this:

// ResourceClaimSpec defines what is being requested in a ResourceClaim and how to configure it.
type ResourceClaimSpec struct {
        // Devices defines how to request devices.
        //
        // oneOf: claimType
        // +optional
        Devices DeviceClaim

        // FirstOfDevices defines devices to claim in a
        //
        // oneOf: claimType
        // +optional
        FirstOfDevices []DeviceClaim

        ...
}

This is arguably simpler and allows them to be essentially complete, alternate claims. It would be more difficult for the user, though, as it would require duplication of other device requests. Additionally, if there were multiple separate FirstAvailableOf requests in a claim, the user would have to specify all the combinations of those in order to get the same flexibility.

Infrastructure Needed (Optional)