Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRA API evolution #14

Merged
merged 71 commits into from
Jun 5, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
a67f632
add "dra-evolution" proposal
pohly May 13, 2024
1504622
dra_evolution: add quota
pohly May 13, 2024
7c2d569
dra-evolution: claim status notes
pohly May 13, 2024
e3b570c
dra-evolution: move pod to api
pohly May 14, 2024
0a77223
dra-evolution: add ResourceClaimTemplate
pohly May 14, 2024
bec6652
dra-evolution: add automatic testing of YAML files
pohly May 14, 2024
ab87abd
dra-evolution/testdata: split up YAML
pohly May 14, 2024
c2a062c
Migrate dra-evolution/testdata/classes.yaml to relevant types
klueska May 14, 2024
84cc5f1
dra-evolution api: fix container resource claim list
pohly May 14, 2024
56c74fd
dra-evolution: validate CEL expressions
pohly May 14, 2024
dea8ded
dra-evolution: fix some CEL expressions
pohly May 14, 2024
83b43b9
dra-evolution: also validate ResourceClaim[Template]
pohly May 14, 2024
73af0cf
dra-prototype README.md: compare CEL syntax
pohly May 14, 2024
ee69061
dra-evolution: consolidate filter and request types
pohly May 15, 2024
8024e0c
dra-evolution: add device.driverName
pohly May 15, 2024
6f8518e
dra-evolution: update classes.yaml
pohly May 15, 2024
d435a4a
dra-evolution README.md: explain how to do a YAML diff
pohly May 15, 2024
8cbe299
dra-evolution: update pod_types.go with proper json/protobuf tags
klueska May 15, 2024
c34e043
dra-evolution: migrate pod-one-container-one-gpu.yaml to relevant types
klueska May 15, 2024
7a622ff
dra-evolution: device.attributes and device.<type>Attributes
pohly May 15, 2024
59de778
dra-evolution: clarify expection for 'one device' class
pohly May 15, 2024
f4d03d5
dra-evolution: update claim_types.go with proper json tags
klueska May 16, 2024
22c7ae6
dra-evolution: migrate pod-one-container-two-gpus-*.yaml to relevant …
klueska May 16, 2024
65d5c49
dra-evolution: add MatchAttributes into ResourceRequestDetail
klueska May 16, 2024
4e9e3f3
dra-evolution: update examples with new "localized" MatchAttributes
klueska May 16, 2024
f4b4052
dra-evolution: migrate two-pods-one-gpu-*.yaml to relevant types
klueska May 16, 2024
bb2a6d8
dra-evolution: harmonize fields
pohly May 16, 2024
846e397
dra-evolution: remove ResourceClaimDevice and expand ResourceClaimEntry
klueska May 16, 2024
7b3aef2
dra-evolution: migrate pod-two-containers-*.yaml to relevant types
klueska May 16, 2024
bfa9359
dra-evolution: migrate pod-one-container-one-gpu-one-vf.yaml to relev…
klueska May 16, 2024
88f447b
dra-evolution: consistently use pcie-root.dra.k8s.io in the examples
klueska May 16, 2024
d3fb9c4
dra-evolution: skip YAML test cases that haven't been converted
pohly May 16, 2024
4cb8709
dra-evolution: unknown keys are not runtime errors
pohly May 16, 2024
45d14ba
dra-evolution: require fully-qualified attribute names
pohly May 16, 2024
ac9871d
dra-evolution: use fully-qualified attribute name
pohly May 16, 2024
aa0d952
dra-evolution: introduce more general request requirements
pohly May 17, 2024
cd5ba42
dra-evolution: add support for expressing shared resources between a …
klueska May 17, 2024
bd8117a
dra-evolution: migrate pools-two-nodes-one-dgxa100.yaml to relevant t…
klueska May 17, 2024
3e5c69f
dra-evolution: refine the notion of a ResourceRequirement based on sh…
klueska May 17, 2024
3b98607
dra-evolution: migrate pod-one-container-shared-split-allocation-gpus…
klueska May 17, 2024
324e2e4
dra-evolution: add missing IntRange
pohly May 18, 2024
e71290b
dra-evolution: introduce "claim requirements"
pohly May 18, 2024
3ec5c7e
dra-evolution: hide empty claim and request options
pohly May 19, 2024
7703dfa
dra-evolution: revise class inheritance
pohly May 19, 2024
b4c48f6
dra-evolution: typo fix
pohly May 19, 2024
74df07f
dra-evolution: harmonize list of vendor configs with other fields
pohly May 19, 2024
adb0b5d
dra-evolution: add proper type for intrange
klueska May 20, 2024
c920b2b
dra-evolution: revise naming of allocation result fields and structs
pohly May 21, 2024
210b376
dra-evolution: remove "forClass" claim source
pohly May 21, 2024
6c887ba
dra-evolution: rename class references
pohly May 21, 2024
aecf75c
fixup! dra-evolution: remove forClass claim source
pohly May 21, 2024
819ac75
dra-evolution: use "constraints" and "requirements"
pohly May 21, 2024
883167e
dra-evolution: update stale content in README.md
pohly May 22, 2024
ed651b1
dra-evolution: fix mock api server
pohly May 22, 2024
1dbb6d2
dra-evolution: remove multi-inheritance of classes
pohly May 22, 2024
6eab574
dra-evolution: remove "source" nesting via inlining
pohly May 22, 2024
9120701
dra-evolution: add support for network-attached devices
pohly May 22, 2024
b18ad3b
Revert "dra-evolution: add proper type for intrange"
pohly May 24, 2024
3b324ee
Revert "dra-evolution: migrate pod-one-container-shared-split-allocat…
pohly May 24, 2024
75d1d2e
Revert "dra-evolution: refine the notion of a ResourceRequirement bas…
pohly May 24, 2024
ed56e13
Revert "dra-evolution: migrate pools-two-nodes-one-dgxa100.yaml to re…
pohly May 24, 2024
9e74635
Revert "dra-evolution: add support for expressing shared resources be…
pohly May 24, 2024
ebfa95f
dra-evolution: ResourceClass -> DeviceClass
pohly May 24, 2024
7aeb3c4
dra-evolution: revise ResourcePool
pohly May 28, 2024
82be70e
DRA: simplified proposal
pohly May 31, 2024
308b670
dra-prototype: review feedback
pohly Jun 3, 2024
13eace9
dra-prototype: use driver names which follow the recommended naming p…
pohly Jun 3, 2024
3ed68fa
dra-evolution: simplify referencing the node in allocation result
pohly Jun 4, 2024
dfd9ec8
dra-evolution: review feedback
pohly Jun 4, 2024
49e124c
dra-evolution: update README.md
pohly Jun 4, 2024
72c1f7c
dra-evolution: typo fix
pohly Jun 5, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 2 additions & 30 deletions dra-evolution/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,30 +3,7 @@
The [k8srm-prototype](../k8srm-prototype/README.md) is an attempt to derive a
new API for device management from scratch. The API in this directory is taking
the opposite approach: it incorporates ideas from the prototype into the 1.30
DRA API. For some problems it picks a different approach. The following
comparison is provided for those who already know one or the other
approach. Everyone else should probably read the proposals first and then come
back here. The last column explains why dra-evolution takes this approach.

| Use case, problem | DRA 1.30 | k8srm-prototype | dra-evolution | rationale |
| --- | --- | --- | --- | --- |
Classes | required, provide admin-level config and the driver name | DeviceClass: required, selects vendor driver and one device | DeviceClass: optional, can be vendor-independent, adds configuration, selection criteria for *devices* (not claims!) | Classes can be useful, but they are not portable across clusters unless we pre-define class names, which shouldn't be a goal. Therefore they are optional for those case where they make sense, but not required. Access control is based on claim fields (management mode) and device attributes, not classes.
Custom APIs with CRDs | Vendors convert CRDs into class or claim parameters. | CRDs only provide configuration, content gets copied during allocation by scheduler. | As in 1.30 minus class CRDs. Claim parameters usually get specified directly in the claim. The ResourceClaimSpecification (= former ResourceClaimParameters) type is only used when a CRD reference is involved. | It is unclear whether any approach that depends on core Kubernetes reading vendor CRDs will pass reviews. Once this is clarified, this aspect can be revisited.
Management access | only in "classic DRA" | Field for device, not in class, checked via v1.ResourceQuota during admission. | Field for device, can be set in class, checked via resource.k8s.io ResourcePolicy during allocation. | Checking at admission time is too limited. Eventually we will need a quota system that is based on device attributes.
Pod-level claims | Flat list with each entry mapping to a claim or claim template. | undecided ? | Flat list with each entry mapping to a claim or claim template. | Adding syntactic sugar like "create a claim for this class" are out of scope for 1.31.
Container-level claim references | name from list | two-level (claim + device in claim) ? | one level (all devices in a claim), two-level (specific device in claim) | The two-level case is needed when using a single claim to do matching between different devices and then wanting a container to use only one of the devices.
Matching between devices | only in "classic DRA" | MatchAttributes in claim | MatchAttributes in claim | This solves a sub-set of the matching problem. A more general solution would be a CEL expression, but that needs more thought and would be harder to use, so providing a "simple" solution seems worthwhile. Matching across claims is not supported by either proposal. This can only be done by putting fields whose semantic might still need to evolve into a v1 API. After GA?
Alternative device sets ("give me X, otherwise Y and Z") | only in "classic DRA" | oneOf, allOf | not supported | "oneOf" would be useful, but can be added later.
Scoring | only in "classic DRA" | none | none | Like matching, this needs to be defined for a claim, with all devices of a potential solution as input. This is a tough problem that already occurs for a single device (pick "smallest" GPU or "biggest"?) and can easily lead to combinatorial explosion.
Claim status | Only allocation | Allocation, per-plugin status | Only allocation, status TBD | Kubelet writing data provided by plugins leads to the [version skew problem](https://github.com/kubernetes/kubernetes/issues/123699). This becomes even worse when that data is likely to change when new status fields get added. This needs more thought before we put anything into the API that depends on sorting out this implementation challenge.
Claim template | Separate type | Re-uses claim + object meta in pod spec | Separate type | Defining claims that will never be used as claims "feels" weird. They also show up in `kubectl get resourceclaims -A` as "unallocated", which could be confusing.
"Resource" vs. "device" | resource | device | resource at top level, device inside | Only some of the semantic defined in the prototype is specific to devices. Other parts (like creating claims from templates, deallocation) are generic. If we ever need to add support for some other kind of resource, we would have to duplicate the entire outer API and copy-and-paste the generic code (Go generics don't support accessing "common" fields unless we define interfaces for everything, also typed client-go, etc.).
Resource model | one, potentially others | only one | one, potentially others, but with simpler YAML structure | The API should be as simple and natural as possible, but we need to keep the ability to add future extensions.
Driver handling allocation | in "classic DRA" | none | in "classic DRA" | We are not going to handle all the advanced scheduling use cases that people have solved with custom DRA control plane controllers, not now and perhaps never. It's too early to drop "classic DRA".
Vendor configuration for multiple devices | vendor parameters in claim and class | none ? | vendor parameters in claim | Storing configuration that isn't specific to one device under one device feels like a workaround.
Partioning | only in "classic DRA" | SharedResources | not added yet, still uses "named resources" | For the sake of simplicity, the current proposal doesn't attempt to modify how instances are described.
CEL syntax | `attributes.<type>[<attribute name>]` = type known at compile time | `device.<attribute name>` = type determined at runtime | `device.attribute[<attribute name>], `device.<type>[<attribute name>]` | Access to attributes is supported both with and without runtime type checking because both can be useful. In both cases, arrays are usef because the mapping of attribute names to CEL field names isn't always obvious ("device-type" -> "deviceType" ?). With the typed maps we can have reasonable default values for unknown keys.

DRA API. For some problems it picks a different approach.
To compare YAML files, something like this can be used:
```
diff -C2 ../k8srm-prototype/testdata/classes.yaml <(sed -e 's;resource.k8s.io/v1alpha2;devmgmtproto.k8s.io/v1alpha1;' -e 's/ResourceClass/DeviceClass/' testdata/classes.yaml)
Expand Down Expand Up @@ -61,12 +38,7 @@ projects.
## Open Questions

The next few sections of this document describe a proposed model. Note that this
is really a brainstorming exercise and under active development. See the [open
questions](open-questions.md) document for some of the still under discussion
items.

We are also looking at how we might extend the existing 1.30 DRA model with some
of these ideas, rather than changing it out for these specific types.
is really a brainstorming exercise and under active development.

## Pod Spec

Expand Down
95 changes: 0 additions & 95 deletions dra-evolution/open-questions.md

This file was deleted.

43 changes: 31 additions & 12 deletions dra-evolution/pkg/api/capacity_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,11 @@ import (
// for a device is the tuple `<driver name>/<node name>/<device name>`. Each
// of these names is a DNS label or domain, so it is okay to concatenate them
// like this in a string with a slash as separator.
johnbelamaric marked this conversation as resolved.
Show resolved Hide resolved
//
// Consumers should be prepared to handle situations where the same device is
// listed in different pools, for example because the producer already added it
// to a new pool before removing it from an old one. Should this occurr, then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/occurr/occur/

non-blocking nit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

// there is still only one such device instance.
johnbelamaric marked this conversation as resolved.
Show resolved Hide resolved
type ResourcePool struct {
pohly marked this conversation as resolved.
Show resolved Hide resolved
metav1.TypeMeta `json:",inline"`
// Standard object metadata
Expand Down Expand Up @@ -53,13 +58,17 @@ type ResourcePoolSpec struct {
DriverName string `json:"driverName" protobuf:"bytes,3,name=driverName"`

// Devices lists all available devices in this pool.
johnbelamaric marked this conversation as resolved.
Show resolved Hide resolved
//
// Must not have more than 128 entries.
Devices []Device `json:"devices,omitempty"`

// FUTURE EXTENSION: some other kind of list, should we ever need it.
// Old clients seeing an empty Devices field can safely ignore the (to
// them) empty pool.
}

const ResourcePoolMaxDevices = 128

// Device represents one individual hardware instance that can be selected based
// on its attributes.
type Device struct {
Expand All @@ -70,14 +79,23 @@ type Device struct {
// Attributes defines the attributes of this device.
// The name of each attribute must be unique.
johnbelamaric marked this conversation as resolved.
Show resolved Hide resolved
//
// Must not have more than 32 entries.
//
// +listType=atomic
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not just about "I don't need patch".

I know API machinery does not enforce unique keys in all cases today but it DOES:
a) support unique-key enforcement in some paths (e.g. SSA)
b) plan to support declarative validation

Declaring this (and all such lists) as +listType=map and +listMapKey=name should be sufficient to auto-generate validation (in time).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding +listType=map to get validation seems wrong to me. I still think the main criteria should be "who owns this" and "how is it being updated". Adding +listType=map when it is not needed can significantly increase the managed fields annotations because then ownership of each entry gets tracked.

If we ever add automatic validation, then I'd like to see it enabled for a new +listType=atomicMap with +listMapKey=name.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From discussion this morning - ACK the potential problem with managedfields. Experiment and we can regroup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's bad: 77% of the protobuf encoding is for managed fields. See https://kubernetes.slack.com/archives/C0EG7JC6T/p1717485534890529 in #sig-api-machinery.

// +optional
Attributes []DeviceAttribute `json:"attributes,omitempty" protobuf:"bytes,3,opt,name=attributes"`

// TODO for 1.31: define how to support partitionable devices
}

const ResourcePoolMaxAttributesPerDevice = 32

// ResourcePoolMaxDevices and ResourcePoolMaxAttributesPerDevice where chosen
// so that with the maximum attribute length of 96 characters the total size of
// the ResourcePool object is around 420KB.

// DeviceAttribute is a combination of an attribute name and its value.
// Exactly one value must be set.
type DeviceAttribute struct {
// Name is a unique identifier for this attribute, which will be
// referenced when selecting devices.
Expand All @@ -95,23 +113,24 @@ type DeviceAttribute struct {
// include the domain prefix are assumed to be part of the driver's
// domain. Attributes defined by 3rd parties must include the domain
// prefix.
//
johnbelamaric marked this conversation as resolved.
Show resolved Hide resolved
// The maximum length for the DNS subdomain is 63 characters (same as
// for driver names) and the maximum length of the C-style identifier
// is 32.
Name string `json:"name" protobuf:"bytes,1,name=name"`

DeviceAttributeValue `json:",inline" protobuf:"bytes,2,opt,name=attributeValue"`
}
// The Go field names below have a Value suffix to avoid a conflict between the
// field "String" and the corresponding method. That method is required.
// The Kubernetes API is defined without that suffix to keep it more natural.

// The Go field names below have a Value suffix to avoid a conflict between the
// field "String" and the corresponding method. That method is required.
// The Kubernetes API is defined without that suffix to keep it more natural.

// DeviceAttributeValue must have one and only one field set.
type DeviceAttributeValue struct {
// QuantityValue is a quantity.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Candidate for list-of-one-of pattern because we might add more types

Suppose we add FloatValue.

A forward-rev driver which tries to set a float will fail at the (back-rev) API because none of the known fields are set. Good.

A back-rev scheduler will know something is wrong because none of the known fields are set. Good.

So this seems OK?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Embedded one-of vs. "struct with some shared fields (name here in this case) and some one-of fields (the value fields here)" have the same semantic, as long as clients are aware.

QuantityValue *resource.Quantity `json:"quantity,omitempty" protobuf:"bytes,6,opt,name=quantity"`
QuantityValue *resource.Quantity `json:"quantity,omitempty" protobuf:"bytes,2,opt,name=quantity"`
// BoolValue is a true/false value.
BoolValue *bool `json:"bool,omitempty" protobuf:"bytes,2,opt,name=bool"`
BoolValue *bool `json:"bool,omitempty" protobuf:"bytes,3,opt,name=bool"`
// StringValue is a string.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to set a max length on this too :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

64?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

StringValue *string `json:"string,omitempty" protobuf:"bytes,5,opt,name=string"`
StringValue *string `json:"string,omitempty" protobuf:"bytes,4,opt,name=string"`
// VersionValue is a semantic version according to semver.org spec 2.0.0.
VersionValue *string `json:"version,omitempty" protobuf:"bytes,10,opt,name=version"`
VersionValue *string `json:"version,omitempty" protobuf:"bytes,5,opt,name=version"`
}

const DeviceAttributeMaxIDLength = 32
Loading