KEP-4816: DRA: Prioritized Alternatives in Device Requests
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
The DRA Structured
Parameters
feature has added the ability to make requests for very specific types of
devices using a ResourceClaim
. However, the current API does not allow the
user to indicate any priority when multiple types or configurations of devices
may meet the needs of the workload. This feature allows the user to specify
alternative requests that statisfy the workloads need, giving the scheduler more
flexiblity in scheduling the workload.
"Obtainability" of certain types of scarce resources is a primary concern of many AI/ML users. GPUs are in high demand, particularly the latest models. This means that workloads that use DRA to specify a need for particular types of GPUs may fail to schedule. In practice, a workload that needs a GPU can be written such that it can discover the GPUs available to it, and work with what it is given. A user may have a preference for the latest model, but would like to run the workload even if only an older model is available.
Similarly, packaged workload authors may wish to configure a workload such that it will work well in the widest selection of available clusters. That is, a distributor of shared workload definitions would like to be able to specify alternative types of devices with which their workload will function, without requiring the user to modify the manifests.
- Allow workload authors, when specifying a
ResourceClaim
, to provide a list of ways to satisfy the claim, with a preference ranking. - Enable schedulers to evaluate those preferences and allocate devices for the claim based on them.
- Enable cluster autoscalers to evaluate those preferences and make scaling choices based on them.
- Provide some measure of ResourceQuota controls when users utilize claims with these types of requests.
- Enable cross-claim consistency of request choices. For example, guaranteeing
that all
ResourceClaim
s associated with a givenDeployment
are satisfied using the same choice from the list of possible alternatives.
The ResourceClaim
object contains a DeviceClaim
, which in turn contains a
list of DeviceRequest
objects. This allows the user to allocate different
types of devices for the same claim, and apply constraints and configuration
across those different requests.
The proposal adds a new field to the DeviceRequest
, called FirstAvailableOf
which will contain an ordered list of DeviceRequest
objects. In order to
satisfy the main (containing) request, exactly one of the requests listed in
FirstAvailableOf
must be satisfied. The order listed is considered a priority
order, such that the scheduler will only try to use the second item in the list
if it is unable to satisfy the first item, and so on.
This allows some flexibility for the user to create, say, a "gpu" request, but allow it to be satisfied by one of several models of GPU.
As a workload author, I want to run a workload that needs a GPU. The workoad itself can work with a few different models of GPU, but may need different numbers of them depending on the model chosen. If the latest model is available in my cluster, I would like to use that, but if it is not I am willing to take a model one generation older. If none of those are available, I am willing to take two GPUs of an even older model.
As a workload author, I want to distribute the manifests of my workloads online. However, there are many different models of device out there, and so I do not want to be too prescriptive in how I define my manifest. If I make it too detailed, then I will either need multiple versions or the users will have to edit the manifest. Instead, I would like to provide some optionality in the types of devices that can meet my workload's needs. For best performance though, I do have a preferred ordering of devices.
ResourceQuota will be enforced such that the user must have quota for each
DeviceRequest
under every FirstAvailableOf
. Thus, this "pick one" behavior
cannot be used to circumvent quota. This reduces the usefulness of the feature,
as it means it will not serve as a quota management feature. However, the
primary goal of the feature is about flexibility across clusters and
obtainability of underlying devices, not quota management.
A DeviceRequest
that populates the FirstAvailableOf
field must not
populate the DeviceClassName
field. The required
validation on this field
will be relaxed. This allows existing clients to differentiate between claims
they understand (with DeviceClassName
) and those they do not (without
DeviceClassName
but with the new field). Clients written for 1.31, when
DeviceClassName
was required, were requested to include this logic, and the
in-tree components have been built in this way.
// DeviceRequest is a request for devices required for a claim.
// This is typically a request for a single resource like a device, but can
// also ask for several identical devices.
type DeviceRequest struct {
// Name can be used to reference this request in a pod.spec.containers[].resources.claims
// entry and in a constraint of the claim.
//
// Must be a DNS label.
//
// +required
Name string
// DeviceClassName references a specific DeviceClass, which can define
// additional configuration and selectors to be inherited by this
// request.
//
// Either a class or FirstAvailableOf requests are required in DeviceClaim.Requests.
// When this request is part of the FirstAvailableOf list, a class is required. Nested
// FirstAvailableOf requests are not allowed
//
// Which classes are available depends on the cluster.
//
// Administrators may use this to restrict which devices may get
// requested by only installing classes with selectors for permitted
// devices. If users are free to request anything without restrictions,
// then administrators can create an empty DeviceClass for users
// to reference.
//
// +optional
// +oneOf=deviceRequestType
DeviceClassName string
// FirstAvailableOf contains subrequests, exactly one of which must be satisfied
// in order to satisfy this request. This field may only be set in the
// entries of DeviceClaim.Requests. It must not be set in DeviceRequest
// instances that themselves are part of a FirstAvailableOf.
//
// +optional
// +oneOf=deviceRequestType
FirstAvailableOf []DeviceRequest
...
}
const (
DeviceSelectorsMaxSize = 32
FirstAvailableOfDeviceRequestMaxSize = 8
)
Let's take a look at an example.
apiVersion: resource.k8s.io/v1alpha4
kind: ResourceClaim
metadata:
name: device-consumer-claim
spec:
devices:
requests:
- name: nic
deviceClassName: rdma-nic
- name: gpu
firstAvailableOf:
- name: big-gpu
deviceClassName: big-gpu
- name: mid-gpu
deviceClassName: mid-gpu
- name: small-gpu
deviceClassName: small-gpu
count: 2
constraints:
- requests: ["nic", gpu"]
matchAttribute:
- dra.k8s.io/pcieRoot
config:
- requests: ["small-gpu"]
opaque:
driver: gpu.acme.example.com
parameters:
apiVersion: gpu.acme.example.com/v1
kind: GPUConfig
mode: multipleGPUs
---
apiVersion: v1
kind: Pod
metadata:
name: device-consumer
spec:
resourceClaims:
- name: "gpu-and-nic"
resourceClaimName: device-consumer-claim
containers:
- name: workload
image: my-app
command: ["/bin/program"]
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
claims:
- name: "gpu-and-nic"
request: "gpu" # the 'nic' request is pod-level, no need to attach to container
There are a few things to note here. First, the "nic" request is listed with a
deviceClassName
, because it has no alternative request types. The "gpu"
request could be met by several different types of GPU, in the listed order of
preference. Each of those is a separate DeviceRequest
, with both a
deviceClassName
and also its own name. The fact that these subrequests also
have their own names allows us to apply constraints or configuration to
specific, individual subrequests, in the event that it is the chosen
alternative. In this example, the "small-gpu" choice requires a configuration
option that the other two choices do not need. Thus, if the resolution of the
"gpu" request is made using the "small-gpu" subrequest, then that configuration
will be attached to the allocation. Otherwise, it will not.
Similarly, for Constraints
, the list of requests can include the main request
name ("gpu" in this case), in which case the constraint applies regardless of
which alternative is chosen. Or, it can include the subrequest name, in which
case that constraint only applies if that particular subrequest is chosen.
In the PodSpec, however, the subrequest names are not valid. Only the main request name may be used.
Currently, the scheduler loops through each entry in DeviceClaim.Requests
and
tries to satisfy each one. This would work essentially the same, except that
today, it throws an
error
when it encounters a claim with a missing DeviceClassName
. Instead, here we
would check for entries in FirstAvailableOf
, and add an additional loop,
trying each of these requests in order.
The current implementation will navigate a depth-first search of the devices,
trying to satisfy all requests and contraints of all claims. The optionality
offered at the DeviceRequest
level provides another index state to track in
the
requestIndices
and deviceIndices
. In the case of the feature gate
disabled, this new index will always be 0.
Alternatively, we can refactor to make this code more defensible via a feature gate.
DRA today works on a "first match" basis for a given node. That would not change with this KEP; on any given node, devices will be tried in the priority order listed in the main request, and the first fit will be returned. However, in practice, nodes typically only have one type of device that would satisfy any of the three requests. That means that individual nodes with any of the listed devices will show as valid nodes for the workload. In order for the scheduler to prefer a node that has the initial prioritized device request, those requests would need a higher score, which currently is planned for beta of this feature. For alpha, the scheduler may still pick a node with a less preferred device, if there are nodes with each type of device available.
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Start of v1.32 development cycle (v1.32.0-alpha.1-178-gd9c46d8ecb1):
k8s.io/dynamic-resource-allocation/cel
: 88.8%k8s.io/dynamic-resource-allocation/structured
: 82.7%k8s.io/kubernetes/pkg/controller/resourceclaim
: 70.0%k8s.io/kubernetes/pkg/scheduler/framework/plugins/dynamicresources
: 72.9%
The existing integration tests for kube-scheduler which measure performance will be extended to cover the overheaad of running the additional logic to support the features in this KEP. These also serve as correctness tests as part of the normal Kubernetes "integration" jobs which cover the dynamic resource controller.
End-to-end testing depends on a working resource driver and a container runtime with CDI support. A test driver was developed as part of the overall DRA development effort. We will extend this test driver to enable support for alternative device requests and add tests to ensure they are handled by the scheduler as described in this KEP.
- Feature implemented behind a feature flag
- Implemented in the scheduler but not necessarily the cluster auto scaler
- Initial e2e tests completed and enabled
- Gather feedback
- Implement node scoring
- Evaluate feasibilty of cluster auto scaler implementation
- Additional tests are in Testgrid and linked in KEP
- 3 examples of real-world usage
- Allowing time for feedback
Standard upgrade/downgrade strategies may be used, no special configuration changes are needed. There are no kubelet or DRA-driver changes for this feature, they are all local to the control plane.
The proposed API change relaxes a required
constraint on the
DeviceRequest.DeviceClassName
field. The DeviceRequest
thus becomes a one-of
that must have either the DeviceClassName
or the FirstAvailableOf
field
populated.
Older clients have been advised in the current implementation to check this field, even though it is required, and fail to allocate a claim that does not have the field set. This means that during rollout, if the API server has this feature, but the scheduler does not, the scheduler will fail to schedule pods that utilize the feature. The pod will be scheduled later according to the new functionality after the scheduler is upgraded.
This feature affects the specific allocations that get made by the scheduler.
Those allocations are stored in the ResourceClaim
status, and will be acted
upon by the kubelet and DRA-driver just as if the user had made the request
without this feature. Thus, there is no impact on the data plane version skew;
if the selected request could be satisfied by the data plane without this
feature, it will work exactly the same with this feature.
This is an add-on on top of the DynamicResourceAllocation
feature gate, which
also must be enabled for this feature to work.
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name: DRAFirstAvailableOf
- Components depending on the feature gate:
- kube-apiserver
- kube-scheduler
- kube-controller-manager
- Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?
No.
Yes. No existing claims or running pods will be affected. This feature affects only the allocation of devices during scheduling.
If a workload controller or Pod uses a ResourceClaimTemplate
that includes
this feature, it could happen that a new Pod may be created and need to be
scheduled, even though the feature is disabled. In this case, the new Pod will
fail to schedule, as the corresponding ResourceClaim
will not be able to be
created.
The recommendation is to remove any usage of this feature in both
ResourceClaim
s and ResourceClaimTemplate
s when disabling the feature, and
force the workloads to use a specific device request instead. This will ensure
that there are no unexpected failures later, if a Pod gets rescheduled to
another node or recreated for some reason.
The feature will begin working again for future scheduling choices that make use
of it. For Deployments
or other users of ResourceClaimTemplate
, previously
failing Pod creations or scheduling may begin to succeed.
Unit tests will be written to validate the enablement and disablement behavior, as well as type conversions for the new field and relaxed validation.
Will consider in the beta timeframe.
Will consider in the beta timeframe.
Will consider in the beta timeframe.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No, though we do relax validation on one field to make it no longer a required field.
Will consider in the beta timeframe.
Will consider in the beta timeframe.
- Events
- Event Reason:
- API .status
- Condition name:
- Other field:
- Other (treat as last resort)
- Details:
Existing DRA and related SLOs continue to apply.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
Will consider in the beta timeframe.
- Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- Other (treat as last resort)
- Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?
We can consider a histogram metric showing how many allocations are made from indices 0-7 of ResourceClaims that utilize this feature.
This feature depends on the DRA structured parameters feature being enabled, and on DRA drivers being deployed. There are no requirements beyond those already needed for DRA structured parameters.
No.
No, just a new field on the ResourceClaim.DeviceRequest
struct.
No.
Yes, when using this field, the user will add additional data in their
ResourceClaim
and ResourceClaimTemplate
objects. This is an incremental
increase on top of the existing structures. The number of alternate requests is
limited to 8 in order to minimize the potential object size.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
Scheduling a claim that uses this feature may take a bit longer, if it is necessary to go deeper into the list of alternative options before finding a suitable device. We can measure this impact in alpha.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
No.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No.
1.32 Enhancements Freeze - KEP merged, alpha implementation initiated
This adds complexity to the scheduler and to the cluster autoscaler, which will simulate the satisfaction of claims with different node shapes.
Rather than embedding a list of alternative request objects, we could use an
indirection at either the ResourceClaim
level, or the DeviceClaim
level.
For example, we could create a new resource claim type by adding a
FirstOfDevices
list to the ResourceClaimSpec
, and making it a one-of with
Devices
.
Something like this:
// ResourceClaimSpec defines what is being requested in a ResourceClaim and how to configure it.
type ResourceClaimSpec struct {
// Devices defines how to request devices.
//
// oneOf: claimType
// +optional
Devices DeviceClaim
// FirstOfDevices defines devices to claim in a
//
// oneOf: claimType
// +optional
FirstOfDevices []DeviceClaim
...
}
This is arguably simpler and allows them to be essentially complete, alternate
claims. It would be more difficult for the user, though, as it would require
duplication of other device requests. Additionally, if there were multiple
separate FirstAvailableOf
requests in a claim, the user would have to specify
all the combinations of those in order to get the same flexibility.