DRA API evolution #14

pohly · 2024-05-14T15:07:49Z

This creates a "dra-evolution" directory with an alternative to the "k8srm-prototype". Whereas"k8srm-prototype" started with a blank slate, "dra-evolution" started with the 1.30 API and got updated to make it more like "k8srm-prototype".

It includes respectively supersedes the following incremental KEP updates:

The top-level README.md has a comparison table with a rationale why certain changes were made (or not made).

It's now possible to do go test ./testdata, which checks that the YAML files and the embedded CEL expressions match the types and CEL syntax.

cmd and pkg/gen have not been updated yet. The pkg/api is complete and now also includes the pod spec changes. But please don't review yet. @pohly and @klueska are still in the process of converting the YAML testdata files, which may lead to changes in the API.

k8s-ci-robot · 2024-05-14T15:07:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [pohly]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

This proposal takes the existing KEP as base and includes: - vendor-independent classes and attributes (kubernetes/enhancements#4614) - optional allocation (kubernetes/enhancements#4619) - inline parameters (kubernetes/enhancements#4613) - management access (kubernetes/enhancements#4611) - renaming "named resources" to "devices" wherever it makes sense and is user-facing (Slack discussion) - MatchAttributes (from k8srm-prototype) - OneOf (from k8srm-prototype) `pkg/api` currently builds, but the rest doesn't. None of the YAML examples have been updated yet.

`go test ./dra-evolution/testdata` ensures that all YAML files in that directory pass strict parsing.

Signed-off-by: Kevin Klues <[email protected]>

This imports the CEL implementation from DRA in 1.30, updates it to use the "device" variable name and to support fully qualified attribute names ("model" and "model.dra.example.com"). Then the parsed objects gets validated, including compilation of the CEL expressions. This cannot cover runtime errors. The type-safe approach uncovers errors like comparing a quantity to a string, which is not supported in Kubernetes because it makes it impossible to do a cost analysis of an expression.

Instead of different types and different field names in class ("devices") and claim ("device"), it's now "device" everywhere. The YAML still looks reasonable and the code becomes simpler. There's also no need to duplicate the explanation of "CEL selector".

This makes the intend behind "check for specific driver" in a CEL expression more obvious.

Without these I was being forced to use upercase names for the fields in YAML Signed-off-by: Kevin Klues <[email protected]>

Signed-off-by: Kevin Klues <[email protected]>

`"foo" in device.string` looked odd. `"foo" in device.attributes` is more natural. It is now also possible to do `device.attributes["foo"]` but not recommended.

Signed-off-by: Kevin Klues <[email protected]>

thockin

Is there more written about how quota will actually work?

For example:

If I write a claim which my quota does not allow (e.g. I ask for 50 GPUs and my quota is 5), does the claim fail to create or fail to schedule? Will the autoscaler know not to provision machines with 50 GPUs?

If I write a OneOf claim, where one of the 2 options would be allowed under my quota, and the other would not - is that allowed?

What if the autoscaler was instructed to prefer creating machines with the variety that is over quota?

This is a whole design problem of its own.

johnbelamaric · 2024-06-03T23:50:38Z

dra-evolution/pkg/api/quota_types.go

+type QuotaSpec struct {
+	// AllowManagementAccess controls whether claims with ManagementAccess
+	// may be allocated. The default if unset is to deny such access.
+	// +optional


Add: If multiple quota objects exist and at least one has a true value, access will be allowed.

Should I also turn the pointer into a value?

Otherwise we can have one object which explicitly says "true" and another which says "false". When it's a value, that second object isn't possible because the "false" gets omitted.

I made it a value.

With the assumption that all allocated devices must be local to the same node, the v1.NodeSelector can be replaced with a node name and the individual results don't need it.

johnbelamaric · 2024-06-04T21:34:51Z

dra-evolution/pkg/api/capacity_types.go

+//
+// Consumers should be prepared to handle situations where the same device is
+// listed in different pools, for example because the producer already added it
+// to a new pool before removing it from an old one. Should this occurr, then


s/occurr/occur/

non-blocking nit

dra-evolution/pkg/api/capacity_types.go

johnbelamaric · 2024-06-04T22:05:09Z

dra-evolution/pkg/api/claim_types.go

 	//
 	// +optional
-	Selector *metav1.LabelSelector `json:"selector,omitempty"`
+	// +listType=atomic
+	Requirements []Requirement `json:"requirements,omitempty" protobuf:"bytes,4,opt,name=requirements"`


Let's just put it on the to do list and not hold the PR.

dra-evolution/pkg/api/claim_types.go

johnbelamaric · 2024-06-04T22:32:35Z

dra-evolution/pkg/api/quota_types.go

+	// - maximum sum of a certain quantity attribute
+	//
+	// These are additional requirements when checking whether a device
+	// instance can satisfy a request. Creating a claim is always allowed,


I don't think it works well to have quota be asynchronous; it would work a lot better at admission time.

The problem is, if there are multiple ways to satisfy a claim, and quota is granular on a CEL-expression level, I don't think we can determine at admission time if the claim will violate quota. For example, if quota is count(device.model == "H100") < 10, and we have already consumed 9 H100s in this namespace, and we have claim that can be satisfied by A100 OR H100, then in theory the claim is satisfiable via A100s.

But this implies then that all schedulers and autoscalers become quota enforcement points, which I think is an expensive and risky proposition.

Either we need to simplify the claims or the quota system, I think.

To make it work at admission time, we need to limit the expressiveness of quota to things that are semantically understood by the platform, AND that can be determined up front from the claim at admission time. Right now that just means:

management access

total count of devices

If we make driver required in the claim, rather than a CEL condition, then we can limit usage by driver. But we still can't by model.

Once allocatable device resources are part of the model, they can be used for quota.

We can't limit by model because it is part of the CEL expression, and I don't think we want to put something like that directly into the claim.

If we make class a required part of claim, then we could limit per class as well. I think that may be our best bet. to make truly expressive quota that can be enforced at admission time. We should allow the admin to define the allow list for classes, plus numerical limits, and require class in the claim. That would give us the most flexibility here.

I don't think it works well to have quota be asynchronous; it would work a lot better at admission time.

Why? Better in which scenario?

It's also worse in some scenarios. For example, suppose someone uses Kubernetes as a batch system. It's normal to create more jobs than currently are allowed to run, so their claims will be "over quota". That's fine, jobs will run as soon as previous jobs are completed.

There may be some cases where a single claim is defined such that it can never be satisfied (asks for 100 devices, only 50 allowed in the namespace). While it would be nice to give users more immediate feedback when they do this, I don't think that it justifies replicating the allocation checks in the admission phase.

Management access is a bit different. Assuming that the namespace settings don't change, asking for it in a namespace where it's not allowed is a permanent failure. We could add an admission check for this.

I think this is again a "UX vs flexibility" question. You want a simple UX, but that can only work by not supporting more complex cases (over quota at admission time, under quota at allocation time). I prefer to start with a flexible approach and then add admission checks where they make sense later (management access).

It's also worse in some scenarios. For example, suppose someone uses Kubernetes as a batch system. It's normal to create more jobs than currently are allowed to run, so their claims will be "over quota". That's fine, jobs will run as soon as previous jobs are completed.

I don't think we should conflate "quota" - what is administratively allowed in this namespace - with "obtainability" - what devices are available for use. Quota for devices should work like other quota, and be applied at admission time. Whether it's memory or devices, as a workload author, I should be notified at admission time that I am asking for more than allowed.

pohly · 2024-06-05T14:26:09Z

dra-evolution/pkg/api/claim_types.go

+	// Must be a DNS label.
+	//
+	// +optional
+	Name string `json:"name" protobuf:"bytes,1,name=name"`


TODO: make this required because we want it to be available for debugging.

johnbelamaric · 2024-06-05T15:08:00Z

/lgtm

thockin

Now that we've sort of reduced it to a pretty tight core, I'm thinking hard about future-safety and considering those places where you had the list-of-one-of pattern. This is something we can deal with post-KEP during implementation. I want to be safe, but I also don't want to over-complicate things in unexplored ways.

This review pass is mostly focused on this topic.

I got as far as Claim and have to step away. Sending comments so I do not forget.

dra-evolution/pkg/api/capacity_types.go

thockin · 2024-06-05T22:30:42Z

dra-evolution/pkg/api/capacity_types.go

+	// are local to that node.
+	//
+	// This is currently required, but this might get relaxed in the future.
+	NodeName string `json:"nodeName"`


Candidate for list-of-one-of pattern because we might add a NodeSelector or just make some resources available from anywhere. IIRC the reason not to just use a node selector is because we want on-node drivers to be authorized only for their own node's pools, right? I'm not well versed in the authorizer subsystem, but wouldn't it be possible to look at a node selector and if it exactly matches matchFields: metadata.name AND driverName matches the credential, then allow?

Assuming that's naive, I will analyze this anyway :)

We know that, in general, control-planes are updated before nodes, and I presume that includes drivers, but on-node drivers are exactly the ones who WON'T need a new option here.

Suppose we have a driver for some rack-switched thing (e.g. a PCIe switch). We install it before updating the cluster. It wants to set nodeSelector: example.com/rack == r93 on this pool, but the apiserver isn't updated. We'll discard that information, and the apiserver will barf because nodeName was not set. Good.

The other direction is the risk - a pool is served which sets NodeSelector and not NodeName.

It seems like a safe assumption that kube-scheduler gets updated in conjunction with the control plane, so I am not super worried about that. But we know we have schedulers and autoscalers out of the core of k8s. If there were a back-rev scheduler which saw such an object, it would know that something is wrong because NodeName was not populated. Good.

As long as we never change nodeName: "" to mean "all nodes", it is not ambiguous. So how would a driver indicate all nodes? nodeSelector: NotIn [] or something, I guess? Or we could make this *string and say "" means "all". Or we could add a Scope: Node | Nodes | Cluster field.

So maybe we don't need special handling here.

I'm not well versed in the authorizer subsystem, but wouldn't it be possible to look at a node selector and if it exactly matches matchFields: metadata.name AND driverName matches the credential, then allow?

I doubt that this would work. Keeping the explicit field is definitely simpler.

So maybe we don't need special handling here.

I came to the same conclusion. The only slightly icky part is that right now, NodeName will never be empty and the apiserver will validate that. But at the same time we know that this may change, so current clients should not rely on that. That's the reason for the "might get relaxed in the future" comment. What they can do is pool.spec.nodeName == node.name. This will continue to work, even if the field becomes empty at some point.

A one-of struct would send a stronger signal to client developers that they have to be prepared for some future change. But that's not needed here.

This will continue to work, even if the field becomes empty at some point.

"Work" in the sense that a scheduler is not doing something wrong. To allocate a network-attached device, an old scheduler has to be extended to also consider a future node selector.

I think we agree - an inline one-of is sufficiently future safe. If you REALLY want to make it a named struct, OK, but only if it actually adds readability.

e.g.

spec: nodeName: foobar

vs

spec: availableFrom: nodeName: foobar

dra-evolution/pkg/api/capacity_types.go

thockin · 2024-06-05T22:44:16Z

dra-evolution/pkg/api/capacity_types.go

+	// field "String" and the corresponding method. That method is required.
+	// The Kubernetes API is defined without that suffix to keep it more natural.
+
+	// QuantityValue is a quantity.


Candidate for list-of-one-of pattern because we might add more types

Suppose we add FloatValue.

A forward-rev driver which tries to set a float will fail at the (back-rev) API because none of the known fields are set. Good.

A back-rev scheduler will know something is wrong because none of the known fields are set. Good.

So this seems OK?

I agree. Embedded one-of vs. "struct with some shared fields (name here in this case) and some one-of fields (the value fields here)" have the same semantic, as long as clients are aware.

thockin · 2024-06-05T22:44:52Z

dra-evolution/pkg/api/capacity_types.go

+	QuantityValue *resource.Quantity `json:"quantity,omitempty" protobuf:"bytes,2,opt,name=quantity"`
+	// BoolValue is a true/false value.
+	BoolValue *bool `json:"bool,omitempty" protobuf:"bytes,3,opt,name=bool"`
+	// StringValue is a string.


need to set a max length on this too :)

thockin · 2024-06-05T22:56:40Z

dra-evolution/pkg/api/claim_types.go

 	//
 	// +optional
-	Selector *metav1.LabelSelector `json:"selector,omitempty"`
+	// +listType=atomic
+	Requirements []Requirement `json:"requirements,omitempty" protobuf:"bytes,4,opt,name=requirements"`


I am skeptical that there will be another form of criteria, and doubly skeptical that we would want to allow BOTH at the same time. So if this became constraints string or even []string I'd be super happy.

thockin · 2024-06-06T19:46:09Z

dra-evolution/pkg/api/claim_types.go

-	// separately, then they could be consistent within claims, but
-	// inconsistent across claims. Therefore, we need this additional
-	// constraint.
+// Constraint must have one and only one field set.


For posterity: This is a proper use of list-of-one-of. We have potentially more than 1 criterion at the same time, and this will signal to back-rev components that there is something here they don't understand.

thockin · 2024-06-06T20:39:10Z

dra-evolution/pkg/api/claim_types.go

+	// This is the node which provides the allocated devices.
+	//
+	// If empty, then the devices are not local to one particular node.
+	NodeName string `json:"nodeName,omitempty" protobuf:"bytes,2,opt,name=nodeName"`


Don't we want this per-device? Or more flexibly, don't we want the same "..or a node selector" as with a device pool?

Fort example, if I claim one device which is rack-switched and one which is row-switched (multiple racks), I might have one device that says rack=rk93 and another device that says row=rw76" and the resulting node selector, for future users of this claim, is the AND of those?

I had the same question here: #26 (comment)

Let's move the discussion there and off of this merged PR...

This is the result of the prototyping in https://github.com/kubernetes-sigs/wg-device-management/tree/main/dra-evolution. The key changes compared to 1.30 are: - Removal of CRD support. Vendor configuration parameters are still supported as raw extensions of the in-tree types. - Removal of the notion of a "device model" and a selection mechanism that only works for one such model. This simplifies the API. - Instead of one claim referencing a ResourceClass, individual requests inside a claim now reference a DeviceClass. - Support for "match attributes". - Support for "management access". - Support for selecting devices from a request in a claim for a container instead of exposing all devices of a claim. Editorial changes: - Reduced usage of the word "resource" in favor of "devices" and "DRA drivers". - Removal of some out-dated text. This revision of the KEP roughly corresponds to https://github.com/kubernetes-sigs/wg-device-management/tree/258d109f54d3baaa48719519dec37a450a1dd5a1/dra-evolution (= kubernetes-sigs/wg-device-management#14 with some minor edits). What is missing: - renaming of ResourceSlice -> ResourcePool (https://github.com/kubernetes-sigs/wg-device-management/pull/28/files#r1630964004) What was added from kubernetes-sigs/wg-device-management#26: - ResourceClaim.Status.Allocation.NodeName -> NodeSelector - flattening of the allocation result Undecided: - kubernetes-sigs/wg-device-management#26 - kubernetes-sigs/wg-device-management#16 - kubernetes-sigs/wg-device-management#18

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 14, 2024

k8s-ci-robot requested review from johnbelamaric and klueska May 14, 2024 15:07

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 14, 2024

pohly force-pushed the dra-evolution branch from 82f0790 to 3faa0dd Compare May 16, 2024 17:37

pohly and others added 22 commits May 16, 2024 19:39

dra_evolution: add quota

1504622

dra-evolution: claim status notes

7c2d569

dra-evolution: move pod to api

e3b570c

dra-evolution: add ResourceClaimTemplate

0a77223

dra-evolution: add automatic testing of YAML files

bec6652

`go test ./dra-evolution/testdata` ensures that all YAML files in that directory pass strict parsing.

dra-evolution/testdata: split up YAML

ab87abd

Migrate dra-evolution/testdata/classes.yaml to relevant types

c2a062c

Signed-off-by: Kevin Klues <[email protected]>

dra-evolution api: fix container resource claim list

84cc5f1

dra-evolution: fix some CEL expressions

dea8ded

dra-evolution: also validate ResourceClaim[Template]

83b43b9

dra-prototype README.md: compare CEL syntax

73af0cf

dra-evolution: add device.driverName

8024e0c

This makes the intend behind "check for specific driver" in a CEL expression more obvious.

dra-evolution: update classes.yaml

6f8518e

dra-evolution README.md: explain how to do a YAML diff

d435a4a

dra-evolution: update pod_types.go with proper json/protobuf tags

8cbe299

Without these I was being forced to use upercase names for the fields in YAML Signed-off-by: Kevin Klues <[email protected]>

dra-evolution: migrate pod-one-container-one-gpu.yaml to relevant types

c34e043

Signed-off-by: Kevin Klues <[email protected]>

dra-evolution: device.attributes and device.<type>Attributes

7a622ff

`"foo" in device.string` looked odd. `"foo" in device.attributes` is more natural. It is now also possible to do `device.attributes["foo"]` but not recommended.

dra-evolution: clarify expection for 'one device' class

59de778

dra-evolution: update claim_types.go with proper json tags

f4d03d5

Signed-off-by: Kevin Klues <[email protected]>

thockin reviewed Jun 3, 2024

View reviewed changes

johnbelamaric reviewed Jun 3, 2024

View reviewed changes

kubernetes-sigs deleted a comment from thockin Jun 4, 2024

pohly added 3 commits June 4, 2024 17:44

dra-evolution: simplify referencing the node in allocation result

3ed68fa

With the assumption that all allocated devices must be local to the same node, the v1.NodeSelector can be replaced with a node name and the individual results don't need it.

dra-evolution: review feedback

dfd9ec8

dra-evolution: update README.md

49e124c

johnbelamaric reviewed Jun 4, 2024

View reviewed changes

dra-evolution: typo fix

72c1f7c

pohly commented Jun 5, 2024

View reviewed changes

k8s-ci-robot assigned johnbelamaric Jun 5, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 5, 2024

k8s-ci-robot merged commit 05bd444 into kubernetes-sigs:main Jun 5, 2024
2 checks passed

pohly mentioned this pull request Jun 5, 2024

dra-evolution: quota mechanism #24

Closed

thockin reviewed Jun 5, 2024

View reviewed changes

thockin reviewed Jun 6, 2024

View reviewed changes

pohly mentioned this pull request Jun 7, 2024

dra-evolution: restore minimal support for "classic DRA" #26

Closed

pohly mentioned this pull request Jun 7, 2024

KEP-4381: DRA update for 1.31 kubernetes/enhancements#4709

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRA API evolution #14

DRA API evolution #14

pohly commented May 14, 2024 •

edited

Loading

k8s-ci-robot commented May 14, 2024

thockin left a comment

johnbelamaric Jun 3, 2024

pohly Jun 4, 2024

pohly Jun 4, 2024

johnbelamaric Jun 4, 2024

pohly Jun 5, 2024

johnbelamaric Jun 4, 2024

johnbelamaric Jun 4, 2024

johnbelamaric Jun 4, 2024

pohly Jun 5, 2024

johnbelamaric Jun 5, 2024

pohly Jun 5, 2024

johnbelamaric commented Jun 5, 2024

thockin left a comment

thockin Jun 5, 2024

pohly Jun 6, 2024

pohly Jun 6, 2024

thockin Jun 6, 2024

thockin Jun 5, 2024

pohly Jun 6, 2024

thockin Jun 5, 2024

pohly Jun 6, 2024

thockin Jun 6, 2024

thockin Jun 5, 2024

thockin Jun 6, 2024

thockin Jun 6, 2024

johnbelamaric Jun 6, 2024

DRA API evolution #14

DRA API evolution #14

Conversation

pohly commented May 14, 2024 • edited Loading

k8s-ci-robot commented May 14, 2024

thockin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnbelamaric commented Jun 5, 2024

thockin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pohly commented May 14, 2024 •

edited

Loading