-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRA API evolution #14
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: pohly The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This proposal takes the existing KEP as base and includes: - vendor-independent classes and attributes (kubernetes/enhancements#4614) - optional allocation (kubernetes/enhancements#4619) - inline parameters (kubernetes/enhancements#4613) - management access (kubernetes/enhancements#4611) - renaming "named resources" to "devices" wherever it makes sense and is user-facing (Slack discussion) - MatchAttributes (from k8srm-prototype) - OneOf (from k8srm-prototype) `pkg/api` currently builds, but the rest doesn't. None of the YAML examples have been updated yet.
`go test ./dra-evolution/testdata` ensures that all YAML files in that directory pass strict parsing.
Signed-off-by: Kevin Klues <[email protected]>
This imports the CEL implementation from DRA in 1.30, updates it to use the "device" variable name and to support fully qualified attribute names ("model" and "model.dra.example.com"). Then the parsed objects gets validated, including compilation of the CEL expressions. This cannot cover runtime errors. The type-safe approach uncovers errors like comparing a quantity to a string, which is not supported in Kubernetes because it makes it impossible to do a cost analysis of an expression.
Instead of different types and different field names in class ("devices") and claim ("device"), it's now "device" everywhere. The YAML still looks reasonable and the code becomes simpler. There's also no need to duplicate the explanation of "CEL selector".
This makes the intend behind "check for specific driver" in a CEL expression more obvious.
Without these I was being forced to use upercase names for the fields in YAML Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
`"foo" in device.string` looked odd. `"foo" in device.attributes` is more natural. It is now also possible to do `device.attributes["foo"]` but not recommended.
Signed-off-by: Kevin Klues <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there more written about how quota will actually work?
For example:
If I write a claim which my quota does not allow (e.g. I ask for 50 GPUs and my quota is 5), does the claim fail to create or fail to schedule? Will the autoscaler know not to provision machines with 50 GPUs?
If I write a OneOf claim, where one of the 2 options would be allowed under my quota, and the other would not - is that allowed?
What if the autoscaler was instructed to prefer creating machines with the variety that is over quota?
This is a whole design problem of its own.
type QuotaSpec struct { | ||
// AllowManagementAccess controls whether claims with ManagementAccess | ||
// may be allocated. The default if unset is to deny such access. | ||
// +optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add: If multiple quota objects exist and at least one has a true value, access will be allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I also turn the pointer into a value?
Otherwise we can have one object which explicitly says "true" and another which says "false". When it's a value, that second object isn't possible because the "false" gets omitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it a value.
With the assumption that all allocated devices must be local to the same node, the v1.NodeSelector can be replaced with a node name and the individual results don't need it.
// | ||
// Consumers should be prepared to handle situations where the same device is | ||
// listed in different pools, for example because the producer already added it | ||
// to a new pool before removing it from an old one. Should this occurr, then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/occurr/occur/
non-blocking nit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
// | ||
// +optional | ||
Selector *metav1.LabelSelector `json:"selector,omitempty"` | ||
// +listType=atomic | ||
Requirements []Requirement `json:"requirements,omitempty" protobuf:"bytes,4,opt,name=requirements"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just put it on the to do list and not hold the PR.
// - maximum sum of a certain quantity attribute | ||
// | ||
// These are additional requirements when checking whether a device | ||
// instance can satisfy a request. Creating a claim is always allowed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it works well to have quota be asynchronous; it would work a lot better at admission time.
The problem is, if there are multiple ways to satisfy a claim, and quota is granular on a CEL-expression level, I don't think we can determine at admission time if the claim will violate quota. For example, if quota is count(device.model == "H100") < 10
, and we have already consumed 9 H100s in this namespace, and we have claim that can be satisfied by A100 OR H100, then in theory the claim is satisfiable via A100s.
But this implies then that all schedulers and autoscalers become quota enforcement points, which I think is an expensive and risky proposition.
Either we need to simplify the claims or the quota system, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it work at admission time, we need to limit the expressiveness of quota to things that are semantically understood by the platform, AND that can be determined up front from the claim at admission time. Right now that just means:
- management access
- total count of devices
If we make driver required in the claim, rather than a CEL condition, then we can limit usage by driver. But we still can't by model.
Once allocatable device resources are part of the model, they can be used for quota.
We can't limit by model because it is part of the CEL expression, and I don't think we want to put something like that directly into the claim.
If we make class a required part of claim, then we could limit per class as well. I think that may be our best bet. to make truly expressive quota that can be enforced at admission time. We should allow the admin to define the allow list for classes, plus numerical limits, and require class in the claim. That would give us the most flexibility here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it works well to have quota be asynchronous; it would work a lot better at admission time.
Why? Better in which scenario?
It's also worse in some scenarios. For example, suppose someone uses Kubernetes as a batch system. It's normal to create more jobs than currently are allowed to run, so their claims will be "over quota". That's fine, jobs will run as soon as previous jobs are completed.
There may be some cases where a single claim is defined such that it can never be satisfied (asks for 100 devices, only 50 allowed in the namespace). While it would be nice to give users more immediate feedback when they do this, I don't think that it justifies replicating the allocation checks in the admission phase.
Management access is a bit different. Assuming that the namespace settings don't change, asking for it in a namespace where it's not allowed is a permanent failure. We could add an admission check for this.
I think this is again a "UX vs flexibility" question. You want a simple UX, but that can only work by not supporting more complex cases (over quota at admission time, under quota at allocation time). I prefer to start with a flexible approach and then add admission checks where they make sense later (management access).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also worse in some scenarios. For example, suppose someone uses Kubernetes as a batch system. It's normal to create more jobs than currently are allowed to run, so their claims will be "over quota". That's fine, jobs will run as soon as previous jobs are completed.
I don't think we should conflate "quota" - what is administratively allowed in this namespace - with "obtainability" - what devices are available for use. Quota for devices should work like other quota, and be applied at admission time. Whether it's memory or devices, as a workload author, I should be notified at admission time that I am asking for more than allowed.
// Must be a DNS label. | ||
// | ||
// +optional | ||
Name string `json:"name" protobuf:"bytes,1,name=name"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: make this required because we want it to be available for debugging.
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we've sort of reduced it to a pretty tight core, I'm thinking hard about future-safety and considering those places where you had the list-of-one-of pattern. This is something we can deal with post-KEP during implementation. I want to be safe, but I also don't want to over-complicate things in unexplored ways.
This review pass is mostly focused on this topic.
I got as far as Claim and have to step away. Sending comments so I do not forget.
// are local to that node. | ||
// | ||
// This is currently required, but this might get relaxed in the future. | ||
NodeName string `json:"nodeName"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Candidate for list-of-one-of pattern because we might add a NodeSelector or just make some resources available from anywhere. IIRC the reason not to just use a node selector is because we want on-node drivers to be authorized only for their own node's pools, right? I'm not well versed in the authorizer subsystem, but wouldn't it be possible to look at a node selector and if it exactly matches matchFields: metadata.name
AND driverName matches the credential, then allow?
Assuming that's naive, I will analyze this anyway :)
We know that, in general, control-planes are updated before nodes, and I presume that includes drivers, but on-node drivers are exactly the ones who WON'T need a new option here.
Suppose we have a driver for some rack-switched thing (e.g. a PCIe switch). We install it before updating the cluster. It wants to set nodeSelector: example.com/rack == r93
on this pool, but the apiserver isn't updated. We'll discard that information, and the apiserver will barf because nodeName
was not set. Good.
The other direction is the risk - a pool is served which sets NodeSelector and not NodeName.
It seems like a safe assumption that kube-scheduler gets updated in conjunction with the control plane, so I am not super worried about that. But we know we have schedulers and autoscalers out of the core of k8s. If there were a back-rev scheduler which saw such an object, it would know that something is wrong because NodeName was not populated. Good.
As long as we never change nodeName: ""
to mean "all nodes", it is not ambiguous. So how would a driver indicate all nodes? nodeSelector: NotIn []
or something, I guess? Or we could make this *string
and say ""
means "all". Or we could add a Scope: Node | Nodes | Cluster
field.
So maybe we don't need special handling here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not well versed in the authorizer subsystem, but wouldn't it be possible to look at a node selector and if it exactly matches matchFields: metadata.name AND driverName matches the credential, then allow?
I doubt that this would work. Keeping the explicit field is definitely simpler.
So maybe we don't need special handling here.
I came to the same conclusion. The only slightly icky part is that right now, NodeName
will never be empty and the apiserver will validate that. But at the same time we know that this may change, so current clients should not rely on that. That's the reason for the "might get relaxed in the future" comment. What they can do is pool.spec.nodeName == node.name
. This will continue to work, even if the field becomes empty at some point.
A one-of struct would send a stronger signal to client developers that they have to be prepared for some future change. But that's not needed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will continue to work, even if the field becomes empty at some point.
"Work" in the sense that a scheduler is not doing something wrong. To allocate a network-attached device, an old scheduler has to be extended to also consider a future node selector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we agree - an inline one-of is sufficiently future safe. If you REALLY want to make it a named struct, OK, but only if it actually adds readability.
e.g.
spec:
nodeName: foobar
vs
spec:
availableFrom:
nodeName: foobar
// field "String" and the corresponding method. That method is required. | ||
// The Kubernetes API is defined without that suffix to keep it more natural. | ||
|
||
// QuantityValue is a quantity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Candidate for list-of-one-of pattern because we might add more types
Suppose we add FloatValue
.
A forward-rev driver which tries to set a float will fail at the (back-rev) API because none of the known fields are set. Good.
A back-rev scheduler will know something is wrong because none of the known fields are set. Good.
So this seems OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Embedded one-of vs. "struct with some shared fields (name here in this case) and some one-of fields (the value fields here)" have the same semantic, as long as clients are aware.
QuantityValue *resource.Quantity `json:"quantity,omitempty" protobuf:"bytes,2,opt,name=quantity"` | ||
// BoolValue is a true/false value. | ||
BoolValue *bool `json:"bool,omitempty" protobuf:"bytes,3,opt,name=bool"` | ||
// StringValue is a string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to set a max length on this too :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
// | ||
// +optional | ||
Selector *metav1.LabelSelector `json:"selector,omitempty"` | ||
// +listType=atomic | ||
Requirements []Requirement `json:"requirements,omitempty" protobuf:"bytes,4,opt,name=requirements"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am skeptical that there will be another form of criteria, and doubly skeptical that we would want to allow BOTH at the same time. So if this became constraints string
or even []string
I'd be super happy.
// separately, then they could be consistent within claims, but | ||
// inconsistent across claims. Therefore, we need this additional | ||
// constraint. | ||
// Constraint must have one and only one field set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For posterity: This is a proper use of list-of-one-of. We have potentially more than 1 criterion at the same time, and this will signal to back-rev components that there is something here they don't understand.
// This is the node which provides the allocated devices. | ||
// | ||
// If empty, then the devices are not local to one particular node. | ||
NodeName string `json:"nodeName,omitempty" protobuf:"bytes,2,opt,name=nodeName"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we want this per-device? Or more flexibly, don't we want the same "..or a node selector" as with a device pool?
Fort example, if I claim one device which is rack-switched and one which is row-switched (multiple racks), I might have one device that says rack=rk93
and another device that says row=rw76" and the resulting node selector, for future users of this claim, is the AND of those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the same question here: #26 (comment)
Let's move the discussion there and off of this merged PR...
This is the result of the prototyping in https://github.com/kubernetes-sigs/wg-device-management/tree/main/dra-evolution. The key changes compared to 1.30 are: - Removal of CRD support. Vendor configuration parameters are still supported as raw extensions of the in-tree types. - Removal of the notion of a "device model" and a selection mechanism that only works for one such model. This simplifies the API. - Instead of one claim referencing a ResourceClass, individual requests inside a claim now reference a DeviceClass. - Support for "match attributes". - Support for "management access". - Support for selecting devices from a request in a claim for a container instead of exposing all devices of a claim. Editorial changes: - Reduced usage of the word "resource" in favor of "devices" and "DRA drivers". - Removal of some out-dated text. This revision of the KEP roughly corresponds to https://github.com/kubernetes-sigs/wg-device-management/tree/258d109f54d3baaa48719519dec37a450a1dd5a1/dra-evolution (= kubernetes-sigs/wg-device-management#14 with some minor edits). What is missing: - renaming of ResourceSlice -> ResourcePool (https://github.com/kubernetes-sigs/wg-device-management/pull/28/files#r1630964004) What was added from kubernetes-sigs/wg-device-management#26: - ResourceClaim.Status.Allocation.NodeName -> NodeSelector - flattening of the allocation result Undecided: - kubernetes-sigs/wg-device-management#26 - kubernetes-sigs/wg-device-management#16 - kubernetes-sigs/wg-device-management#18
This is the result of the prototyping in https://github.com/kubernetes-sigs/wg-device-management/tree/main/dra-evolution. The key changes compared to 1.30 are: - Removal of CRD support. Vendor configuration parameters are still supported as raw extensions of the in-tree types. - Removal of the notion of a "device model" and a selection mechanism that only works for one such model. This simplifies the API. - Instead of one claim referencing a ResourceClass, individual requests inside a claim now reference a DeviceClass. - Support for "match attributes". - Support for "management access". - Support for selecting devices from a request in a claim for a container instead of exposing all devices of a claim. Editorial changes: - Reduced usage of the word "resource" in favor of "devices" and "DRA drivers". - Removal of some out-dated text. This revision of the KEP roughly corresponds to https://github.com/kubernetes-sigs/wg-device-management/tree/258d109f54d3baaa48719519dec37a450a1dd5a1/dra-evolution (= kubernetes-sigs/wg-device-management#14 with some minor edits). What is missing: - renaming of ResourceSlice -> ResourcePool (https://github.com/kubernetes-sigs/wg-device-management/pull/28/files#r1630964004) What was added from kubernetes-sigs/wg-device-management#26: - ResourceClaim.Status.Allocation.NodeName -> NodeSelector - flattening of the allocation result Undecided: - kubernetes-sigs/wg-device-management#26 - kubernetes-sigs/wg-device-management#16 - kubernetes-sigs/wg-device-management#18
This creates a "dra-evolution" directory with an alternative to the "k8srm-prototype". Whereas"k8srm-prototype" started with a blank slate, "dra-evolution" started with the 1.30 API and got updated to make it more like "k8srm-prototype".
It includes respectively supersedes the following incremental KEP updates:
The top-level README.md has a comparison table with a rationale why certain changes were made (or not made).
It's now possible to do
go test ./testdata
, which checks that the YAML files and the embedded CEL expressions match the types and CEL syntax.cmd
andpkg/gen
have not been updated yet. Thepkg/api
is complete and now also includes the pod spec changes. But please don't review yet. @pohly and @klueska are still in the process of converting the YAML testdata files, which may lead to changes in the API.