Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRA API evolution #14

Merged
merged 71 commits into from
Jun 5, 2024
Merged

DRA API evolution #14

merged 71 commits into from
Jun 5, 2024

Conversation

pohly
Copy link
Contributor

@pohly pohly commented May 14, 2024

This creates a "dra-evolution" directory with an alternative to the "k8srm-prototype". Whereas"k8srm-prototype" started with a blank slate, "dra-evolution" started with the 1.30 API and got updated to make it more like "k8srm-prototype".

It includes respectively supersedes the following incremental KEP updates:

The top-level README.md has a comparison table with a rationale why certain changes were made (or not made).

It's now possible to do go test ./testdata, which checks that the YAML files and the embedded CEL expressions match the types and CEL syntax.

cmd and pkg/gen have not been updated yet. The pkg/api is complete and now also includes the pod spec changes. But please don't review yet. @pohly and @klueska are still in the process of converting the YAML testdata files, which may lead to changes in the API.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 14, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 14, 2024
pohly and others added 22 commits May 16, 2024 19:39
This proposal takes the existing KEP as base and includes:
- vendor-independent classes and attributes (kubernetes/enhancements#4614)
- optional allocation (kubernetes/enhancements#4619)
- inline parameters (kubernetes/enhancements#4613)
- management access (kubernetes/enhancements#4611)
- renaming "named resources" to "devices" wherever it makes sense and is
  user-facing (Slack discussion)
- MatchAttributes (from k8srm-prototype)
- OneOf (from k8srm-prototype)

`pkg/api` currently builds, but the rest doesn't. None of the YAML examples
have been updated yet.
`go test ./dra-evolution/testdata` ensures that all YAML files in that
directory pass strict parsing.
This imports the CEL implementation from DRA in 1.30, updates it to use the
"device" variable name and to support fully qualified attribute names ("model"
and "model.dra.example.com").

Then the parsed objects gets validated, including compilation of the CEL
expressions. This cannot cover runtime errors. The type-safe approach uncovers
errors like comparing a quantity to a string, which is not supported in
Kubernetes because it makes it impossible to do a cost analysis of an
expression.
Instead of different types and different field names in class ("devices") and
claim ("device"), it's now "device" everywhere. The YAML still looks reasonable
and the code becomes simpler. There's also no need to duplicate the explanation
of "CEL selector".
This makes the intend behind "check for specific driver" in a CEL
expression more obvious.
Without these I was being forced to use upercase names for the fields in YAML

Signed-off-by: Kevin Klues <[email protected]>
`"foo" in device.string` looked odd. `"foo" in device.attributes` is more
natural. It is now also possible to do `device.attributes["foo"]` but not
recommended.
Copy link

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there more written about how quota will actually work?

For example:

If I write a claim which my quota does not allow (e.g. I ask for 50 GPUs and my quota is 5), does the claim fail to create or fail to schedule? Will the autoscaler know not to provision machines with 50 GPUs?

If I write a OneOf claim, where one of the 2 options would be allowed under my quota, and the other would not - is that allowed?

What if the autoscaler was instructed to prefer creating machines with the variety that is over quota?

This is a whole design problem of its own.

type QuotaSpec struct {
// AllowManagementAccess controls whether claims with ManagementAccess
// may be allocated. The default if unset is to deny such access.
// +optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add: If multiple quota objects exist and at least one has a true value, access will be allowed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I also turn the pointer into a value?

Otherwise we can have one object which explicitly says "true" and another which says "false". When it's a value, that second object isn't possible because the "false" gets omitted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it a value.

@kubernetes-sigs kubernetes-sigs deleted a comment from thockin Jun 4, 2024
pohly added 3 commits June 4, 2024 17:44
With the assumption that all allocated devices must be local to the same node,
the v1.NodeSelector can be replaced with a node name and the individual results
don't need it.
//
// Consumers should be prepared to handle situations where the same device is
// listed in different pools, for example because the producer already added it
// to a new pool before removing it from an old one. Should this occurr, then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/occurr/occur/

non-blocking nit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

dra-evolution/pkg/api/capacity_types.go Show resolved Hide resolved
dra-evolution/pkg/api/capacity_types.go Show resolved Hide resolved
//
// +optional
Selector *metav1.LabelSelector `json:"selector,omitempty"`
// +listType=atomic
Requirements []Requirement `json:"requirements,omitempty" protobuf:"bytes,4,opt,name=requirements"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just put it on the to do list and not hold the PR.

dra-evolution/pkg/api/claim_types.go Show resolved Hide resolved
// - maximum sum of a certain quantity attribute
//
// These are additional requirements when checking whether a device
// instance can satisfy a request. Creating a claim is always allowed,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it works well to have quota be asynchronous; it would work a lot better at admission time.

The problem is, if there are multiple ways to satisfy a claim, and quota is granular on a CEL-expression level, I don't think we can determine at admission time if the claim will violate quota. For example, if quota is count(device.model == "H100") < 10, and we have already consumed 9 H100s in this namespace, and we have claim that can be satisfied by A100 OR H100, then in theory the claim is satisfiable via A100s.

But this implies then that all schedulers and autoscalers become quota enforcement points, which I think is an expensive and risky proposition.

Either we need to simplify the claims or the quota system, I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it work at admission time, we need to limit the expressiveness of quota to things that are semantically understood by the platform, AND that can be determined up front from the claim at admission time. Right now that just means:

  • management access
  • total count of devices

If we make driver required in the claim, rather than a CEL condition, then we can limit usage by driver. But we still can't by model.

Once allocatable device resources are part of the model, they can be used for quota.

We can't limit by model because it is part of the CEL expression, and I don't think we want to put something like that directly into the claim.

If we make class a required part of claim, then we could limit per class as well. I think that may be our best bet. to make truly expressive quota that can be enforced at admission time. We should allow the admin to define the allow list for classes, plus numerical limits, and require class in the claim. That would give us the most flexibility here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it works well to have quota be asynchronous; it would work a lot better at admission time.

Why? Better in which scenario?

It's also worse in some scenarios. For example, suppose someone uses Kubernetes as a batch system. It's normal to create more jobs than currently are allowed to run, so their claims will be "over quota". That's fine, jobs will run as soon as previous jobs are completed.

There may be some cases where a single claim is defined such that it can never be satisfied (asks for 100 devices, only 50 allowed in the namespace). While it would be nice to give users more immediate feedback when they do this, I don't think that it justifies replicating the allocation checks in the admission phase.

Management access is a bit different. Assuming that the namespace settings don't change, asking for it in a namespace where it's not allowed is a permanent failure. We could add an admission check for this.

I think this is again a "UX vs flexibility" question. You want a simple UX, but that can only work by not supporting more complex cases (over quota at admission time, under quota at allocation time). I prefer to start with a flexible approach and then add admission checks where they make sense later (management access).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also worse in some scenarios. For example, suppose someone uses Kubernetes as a batch system. It's normal to create more jobs than currently are allowed to run, so their claims will be "over quota". That's fine, jobs will run as soon as previous jobs are completed.

I don't think we should conflate "quota" - what is administratively allowed in this namespace - with "obtainability" - what devices are available for use. Quota for devices should work like other quota, and be applied at admission time. Whether it's memory or devices, as a workload author, I should be notified at admission time that I am asking for more than allowed.

// Must be a DNS label.
//
// +optional
Name string `json:"name" protobuf:"bytes,1,name=name"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: make this required because we want it to be available for debugging.

@johnbelamaric
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 5, 2024
@k8s-ci-robot k8s-ci-robot merged commit 05bd444 into kubernetes-sigs:main Jun 5, 2024
2 checks passed
Copy link

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we've sort of reduced it to a pretty tight core, I'm thinking hard about future-safety and considering those places where you had the list-of-one-of pattern. This is something we can deal with post-KEP during implementation. I want to be safe, but I also don't want to over-complicate things in unexplored ways.

This review pass is mostly focused on this topic.

I got as far as Claim and have to step away. Sending comments so I do not forget.

dra-evolution/pkg/api/capacity_types.go Show resolved Hide resolved
// are local to that node.
//
// This is currently required, but this might get relaxed in the future.
NodeName string `json:"nodeName"`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Candidate for list-of-one-of pattern because we might add a NodeSelector or just make some resources available from anywhere. IIRC the reason not to just use a node selector is because we want on-node drivers to be authorized only for their own node's pools, right? I'm not well versed in the authorizer subsystem, but wouldn't it be possible to look at a node selector and if it exactly matches matchFields: metadata.name AND driverName matches the credential, then allow?

Assuming that's naive, I will analyze this anyway :)

We know that, in general, control-planes are updated before nodes, and I presume that includes drivers, but on-node drivers are exactly the ones who WON'T need a new option here.

Suppose we have a driver for some rack-switched thing (e.g. a PCIe switch). We install it before updating the cluster. It wants to set nodeSelector: example.com/rack == r93 on this pool, but the apiserver isn't updated. We'll discard that information, and the apiserver will barf because nodeName was not set. Good.

The other direction is the risk - a pool is served which sets NodeSelector and not NodeName.

It seems like a safe assumption that kube-scheduler gets updated in conjunction with the control plane, so I am not super worried about that. But we know we have schedulers and autoscalers out of the core of k8s. If there were a back-rev scheduler which saw such an object, it would know that something is wrong because NodeName was not populated. Good.

As long as we never change nodeName: "" to mean "all nodes", it is not ambiguous. So how would a driver indicate all nodes? nodeSelector: NotIn [] or something, I guess? Or we could make this *string and say "" means "all". Or we could add a Scope: Node | Nodes | Cluster field.

So maybe we don't need special handling here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not well versed in the authorizer subsystem, but wouldn't it be possible to look at a node selector and if it exactly matches matchFields: metadata.name AND driverName matches the credential, then allow?

I doubt that this would work. Keeping the explicit field is definitely simpler.

So maybe we don't need special handling here.

I came to the same conclusion. The only slightly icky part is that right now, NodeName will never be empty and the apiserver will validate that. But at the same time we know that this may change, so current clients should not rely on that. That's the reason for the "might get relaxed in the future" comment. What they can do is pool.spec.nodeName == node.name. This will continue to work, even if the field becomes empty at some point.

A one-of struct would send a stronger signal to client developers that they have to be prepared for some future change. But that's not needed here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will continue to work, even if the field becomes empty at some point.

"Work" in the sense that a scheduler is not doing something wrong. To allocate a network-attached device, an old scheduler has to be extended to also consider a future node selector.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we agree - an inline one-of is sufficiently future safe. If you REALLY want to make it a named struct, OK, but only if it actually adds readability.

e.g.

spec:
     nodeName: foobar

vs

spec:
  availableFrom:
    nodeName: foobar

dra-evolution/pkg/api/capacity_types.go Show resolved Hide resolved
dra-evolution/pkg/api/capacity_types.go Show resolved Hide resolved
// field "String" and the corresponding method. That method is required.
// The Kubernetes API is defined without that suffix to keep it more natural.

// QuantityValue is a quantity.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Candidate for list-of-one-of pattern because we might add more types

Suppose we add FloatValue.

A forward-rev driver which tries to set a float will fail at the (back-rev) API because none of the known fields are set. Good.

A back-rev scheduler will know something is wrong because none of the known fields are set. Good.

So this seems OK?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Embedded one-of vs. "struct with some shared fields (name here in this case) and some one-of fields (the value fields here)" have the same semantic, as long as clients are aware.

QuantityValue *resource.Quantity `json:"quantity,omitempty" protobuf:"bytes,2,opt,name=quantity"`
// BoolValue is a true/false value.
BoolValue *bool `json:"bool,omitempty" protobuf:"bytes,3,opt,name=bool"`
// StringValue is a string.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to set a max length on this too :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

64?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

//
// +optional
Selector *metav1.LabelSelector `json:"selector,omitempty"`
// +listType=atomic
Requirements []Requirement `json:"requirements,omitempty" protobuf:"bytes,4,opt,name=requirements"`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am skeptical that there will be another form of criteria, and doubly skeptical that we would want to allow BOTH at the same time. So if this became constraints string or even []string I'd be super happy.

// separately, then they could be consistent within claims, but
// inconsistent across claims. Therefore, we need this additional
// constraint.
// Constraint must have one and only one field set.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For posterity: This is a proper use of list-of-one-of. We have potentially more than 1 criterion at the same time, and this will signal to back-rev components that there is something here they don't understand.

// This is the node which provides the allocated devices.
//
// If empty, then the devices are not local to one particular node.
NodeName string `json:"nodeName,omitempty" protobuf:"bytes,2,opt,name=nodeName"`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want this per-device? Or more flexibly, don't we want the same "..or a node selector" as with a device pool?

Fort example, if I claim one device which is rack-switched and one which is row-switched (multiple racks), I might have one device that says rack=rk93 and another device that says row=rw76" and the resulting node selector, for future users of this claim, is the AND of those?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same question here: #26 (comment)

Let's move the discussion there and off of this merged PR...

pohly added a commit to pohly/enhancements that referenced this pull request Jun 7, 2024
This is the result of the prototyping in
https://github.com/kubernetes-sigs/wg-device-management/tree/main/dra-evolution.

The key changes compared to 1.30 are:
- Removal of CRD support. Vendor configuration parameters
  are still supported as raw extensions of the in-tree types.
- Removal of the notion of a "device model" and a selection mechanism
  that only works for one such model. This simplifies the API.
- Instead of one claim referencing a ResourceClass, individual
  requests inside a claim now reference a DeviceClass.
- Support for "match attributes".
- Support for "management access".
- Support for selecting devices from a request in a claim for a
  container instead of exposing all devices of a claim.

Editorial changes:
- Reduced usage of the word "resource" in favor of "devices" and
  "DRA drivers".
- Removal of some out-dated text.

This revision of the KEP roughly corresponds to
https://github.com/kubernetes-sigs/wg-device-management/tree/258d109f54d3baaa48719519dec37a450a1dd5a1/dra-evolution
(= kubernetes-sigs/wg-device-management#14 with some
minor edits).

What is missing:
- renaming of ResourceSlice ->
  ResourcePool (https://github.com/kubernetes-sigs/wg-device-management/pull/28/files#r1630964004)

What was added from kubernetes-sigs/wg-device-management#26:
- ResourceClaim.Status.Allocation.NodeName -> NodeSelector
- flattening of the allocation result

Undecided:
- kubernetes-sigs/wg-device-management#26
- kubernetes-sigs/wg-device-management#16
- kubernetes-sigs/wg-device-management#18
pohly added a commit to pohly/enhancements that referenced this pull request Jun 11, 2024
This is the result of the prototyping in
https://github.com/kubernetes-sigs/wg-device-management/tree/main/dra-evolution.

The key changes compared to 1.30 are:
- Removal of CRD support. Vendor configuration parameters
  are still supported as raw extensions of the in-tree types.
- Removal of the notion of a "device model" and a selection mechanism
  that only works for one such model. This simplifies the API.
- Instead of one claim referencing a ResourceClass, individual
  requests inside a claim now reference a DeviceClass.
- Support for "match attributes".
- Support for "management access".
- Support for selecting devices from a request in a claim for a
  container instead of exposing all devices of a claim.

Editorial changes:
- Reduced usage of the word "resource" in favor of "devices" and
  "DRA drivers".
- Removal of some out-dated text.

This revision of the KEP roughly corresponds to
https://github.com/kubernetes-sigs/wg-device-management/tree/258d109f54d3baaa48719519dec37a450a1dd5a1/dra-evolution
(= kubernetes-sigs/wg-device-management#14 with some
minor edits).

What is missing:
- renaming of ResourceSlice ->
  ResourcePool (https://github.com/kubernetes-sigs/wg-device-management/pull/28/files#r1630964004)

What was added from kubernetes-sigs/wg-device-management#26:
- ResourceClaim.Status.Allocation.NodeName -> NodeSelector
- flattening of the allocation result

Undecided:
- kubernetes-sigs/wg-device-management#26
- kubernetes-sigs/wg-device-management#16
- kubernetes-sigs/wg-device-management#18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants