From 064eb667b2ae1b3b91d69b397664fe33855498b4 Mon Sep 17 00:00:00 2001 From: Patrick Ohly Date: Mon, 21 Nov 2022 16:26:20 +0100 Subject: [PATCH] dynamic resource allocation: match API with actual implementation in 1.26 This copies back the API description from pkg/apis/core/types.go which got changed as part of the code review. The focus of the API comments shifted from describing how the API is used to what it contains. The additional comments from the KEP get preserved as commentary on the API. The other change is the introduction if resource.k8s.io and making templates a stand-alone object. Co-authored-by: Kevin Klues --- .../README.md | 752 ++++++++++-------- 1 file changed, 418 insertions(+), 334 deletions(-) diff --git a/keps/sig-node/3063-dynamic-resource-allocation/README.md b/keps/sig-node/3063-dynamic-resource-allocation/README.md index 47febcc80c3..d3c81db95bd 100644 --- a/keps/sig-node/3063-dynamic-resource-allocation/README.md +++ b/keps/sig-node/3063-dynamic-resource-allocation/README.md @@ -98,6 +98,8 @@ SIG Architecture for cross-cutting KEPs). - [Coordinating resource allocation through the scheduler](#coordinating-resource-allocation-through-the-scheduler) - [Resource allocation and usage flow](#resource-allocation-and-usage-flow) - [API](#api) + - [resource.k8s.io](#resourcek8sio) + - [core](#core) - [kube-controller-manager](#kube-controller-manager) - [kube-scheduler](#kube-scheduler) - [Pre-filter](#pre-filter) @@ -441,6 +443,21 @@ metadata: name: device-consumer-gpu-parameters memory: "2Gi" --- +apiVersion: resource.k8s.io/v1alpha1 +kind: ResourceClaimTemplate +metadata: + name: device-consumer-gpu-template +spec: + metadata: + # Additional annotations or labels for the + # ResourceClaim could be specified here. + spec: + resourceClassName: "acme-gpu" + parametersRef: + apiGroup: gpu.example.com + kind: GPURequirements + name: device-consumer-gpu-parameters +--- apiVersion: v1 kind: Pod metadata: @@ -449,11 +466,7 @@ spec: resourceClaims: - name: "gpu" # this name gets referenced below under "claims" template: - resourceClassName: "acme-gpu" - parametersRef: - apiGroup: gpu.example.com - kind: GPURequirements - name: device-consumer-gpu-parameters + resourceClaimTemplateName: device-consumer-gpu-template containers: - name: workload image: my-app @@ -686,7 +699,7 @@ Several components must be implemented or modified in Kubernetes: - The new API must be added to kube-apiserver. The ResourceQuota admission plugin needs to check the new quota limits when ResourceClaims get created. - A new controller in kube-controller-manager which creates - ResourceClaims from Pod ResourceClaimTemplates, similar to + ResourceClaims from ResourceClaimTemplates, similar to https://github.com/kubernetes/kubernetes/tree/master/pkg/controller/volume/ephemeral. It also removes the reservation entry for a user in `claim.status.reservedFor`, the field that tracks who is allowed to use a claim, when that user no longer exists. @@ -903,7 +916,7 @@ ResourceClaim. But often, each Pod is meant to have exclusive access to its own ResourceClaim instance instead. To support such ephemeral resources without having to modify all controllers that create Pods, an entry in the new PodSpec.ResourceClaims -list can also be a ResourceClaimTemplate. When a Pod gets created, such a +list can also be a reference to a ResourceClaimTemplate. When a Pod gets created, such a template will be used to create a normal ResourceClaim with the Pod as owner with an [OwnerReference](https://pkg.go.dev/k8s.io/apimachinery/pkg/apis/meta/v1#OwnerReference)), @@ -939,8 +952,8 @@ arbitrarily. Some combinations are more useful than others: | claim | slow allocation as soon | while they are not needed yet | | | as possible | | +-----------+------------------------------------+---------------------------------+ -| inline | same benefit as above, | resource allocated when needed, | -| claim | but ignores other pod constraints | allocation coordinated by | +| claim | same benefit as above, | resource allocated when needed, | +| template | but ignores other pod constraints | allocation coordinated by | | | during allocation | scheduler | +-----------+------------------------------------+---------------------------------+ ``` @@ -1037,11 +1050,11 @@ deallocating one or more claims. ### Resource allocation and usage flow The following steps shows how resource allocation works for a resource that -gets defined inline in a Pod. Several of these steps may fail without changing +gets defined in a ResourceClaimTemplate and referenced by a Pod. Several of these steps may fail without changing the system state. They then must be retried until they succeed or something else changes in the system, like for example deleting objects. -* **user** creates Pod with inline ResourceClaimTemplate +* **user** creates Pod with reference to ResourceClaimTemplate * **resource claim controller** checks ResourceClaimTemplate and ResourceClass, then creates ResourceClaim with Pod as owner * if *immediate allocation*: @@ -1107,114 +1120,133 @@ claims. ### API -The PodSpec gets extended. Types and structs referenced from PodSpec -(ResourceClaimTemplate, ResourceClaimSpec) must be placed in `core.k8s.io/v1` -to keep that package self-contained. The new fields in the PodSpec are gated by -the DynamicResourceAllocation feature gate and can only be set when it is -enabled. Initially, they are declared as alpha and then can still be changed -from one release to the next. - -ResourceClaim and ResourceClass are new built-in types in `core.k8s.io/v1`. -They only get registered when the feature gate is enabled. - -Putting everything into `core.k8s.io/v1` was chosen instead of using CRDs -because core Kubernetes components must interact with the new objects and -installation of CRDs as part of cluster creation is an unsolved problem. They -could be defined in some other API group (`node.k8s.io`, some new group) but -having everything in the same group is less confusing for users. +The PodSpec gets extended. To minimize the changes in core/v1, all new types +get defined in a new resource group. This makes it possible to revise those +more experimental parts of the API in the future. The new fields in the +PodSpec are gated by the DynamicResourceAllocation feature gate and can only be +set when it is enabled. Initially, they are declared as alpha. Even though they +are alpha, changes to their schema are discouraged and would have to be done by +using new field names. + +ResourceClaim, ResourceClass and ResourceClaimTemplate are new built-in types +in `resource.k8s.io/v1alpha1`. This alpha group must be explicitly enabled in +the apiserver's runtime configuration. Using builtin types was chosen instead +of using CRDs because core Kubernetes components must interact with the new +objects and installation of CRDs as part of cluster creation is an unsolved +problem. Secrets are not part of this API: if a resource driver needs secrets, for example to access its own backplane, then it can define custom parameters for those secrets and retrieve them directly from the apiserver. This works because drivers are expected to be written for Kubernetes. +#### resource.k8s.io + ``` // ResourceClass is used by administrators to influence how resources // are allocated. +// +// This is an alpha type and requires enabling the DynamicResourceAllocation +// feature gate. type ResourceClass struct { metav1.TypeMeta - // More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata + // Standard object metadata + // +optional metav1.ObjectMeta - // DriverName determines which resource driver is to be used for - // allocation of a ResourceClaim that uses this class. + // DriverName defines the name of the dynamic resource driver that is + // used for allocation of a ResourceClaim that uses this class. // - // Resource drivers have a unique name in normal domain order + // Resource drivers have a unique name in forward domain order // (acme.example.com). DriverName string // ParametersRef references an arbitrary separate object that may hold - // parameters that will be used by the - // driver when allocating a resource that uses this class. The driver - // will be able to distinguish between parameters stored here and and - // those stored in ResourceClaimSpec. These parameters here can only be - // set by cluster administrators. - ParametersRef ResourceClassParametersReference + // parameters that will be used by the driver when allocating a + // resource that uses this class. A dynamic resource driver can + // distinguish between parameters stored here and and those stored in + // ResourceClaimSpec. + // +optional + ParametersRef *ResourceClassParametersReference // Only nodes matching the selector will be considered by the scheduler // when trying to find a Node that fits a Pod when that Pod uses // a ResourceClaim that has not been allocated yet. // - // Setting this field is optional. If nil, all nodes are candidates. + // Setting this field is optional. If null, all nodes are candidates. + // +optional SuitableNodes *core.NodeSelector } +``` + +A copy of the driver name is necessary to enable usage +of the claim by the kubelet in case the ResourceClass gets +removed in the meantime. It also helps the resource driver +to determine whether it needs to handle a claim that got +marked for deletion. -// ResourceClaim is created by users to describe which resources they need. +``` +// ResourceClaim describes which resources are needed by a resource consumer. // Its status tracks whether the resource has been allocated and what the // resulting attributes are. +// +// This is an alpha type and requires enabling the DynamicResourceAllocation +// feature gate. type ResourceClaim struct { metav1.TypeMeta - - // The driver must set a finalizer here before it attempts to allocate - // the resource. It removes the finalizer again when a) the allocation - // attempt has definitely failed or b) when the allocated resource was - // deallocated. This helps to ensure that resources are not leaked - // during normal operation of the cluster. - // - // It cannot prevent force-deleting a ResourceClaim by clearing its - // finalizers (something that users should never do without being aware - // of the consequences) or help when the entire cluster gets deleted. - // - // More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata + // Standard object metadata + // +optional metav1.ObjectMeta // Spec describes the desired attributes of a resource that then needs // to be allocated. It can only be set once when creating the // ResourceClaim. - Spec core.ResourceClaimSpec + Spec ResourceClaimSpec // Status describes whether the resource is available and with which // attributes. + // +optional Status ResourceClaimStatus } +``` + +The driver must set a finalizer in a ResourceClaim before it attempts to allocate +the resource. It removes the finalizer when a) the allocation +attempt has definitely failed or b) the allocated resource was +deallocated. This helps to ensure that resources are not leaked +during normal operation of the cluster. + +It cannot prevent force-deleting a ResourceClaim by clearing its +finalizers (something that users should never do without being aware +of the consequences) or help when the entire cluster gets deleted. +``` // ResourceClaimSpec defines how a resource is to be allocated. type ResourceClaimSpec struct { // ResourceClassName references the driver and additional parameters // via the name of a ResourceClass that was created as part of the // driver deployment. - // - // The apiserver does not check that the referenced class exists, but a - // driver-specific admission webhook may require that and is allowed to - // reject claims where the class is missing. ResourceClassName string // ParametersRef references a separate object with arbitrary parameters - // that will be used by the - // driver when allocating a resource for the claim. + // that will be used by the driver when allocating a resource for the + // claim. // // The object must be in the same namespace as the ResourceClaim. - ParametersRef ResourceClaimParametersReference + // +optional + ParametersRef *ResourceClaimParametersReference // Allocation can start immediately or when a Pod wants to use the - // resource. Waiting for a Pod is the default. + // resource. "WaitForFirstConsumer" is the default. + // +optional AllocationMode AllocationMode } // AllocationMode describes whether a ResourceClaim gets allocated immediately // when it gets created (AllocationModeImmediate) or whether allocation is -// delayed until it is needed for a Pod (AllocationModeWaitForFirstConsumer). Other modes -// might get added in the future. +// delayed until it is needed for a Pod +// (AllocationModeWaitForFirstConsumer). Other modes might get added in the +// future. type AllocationMode string const ( @@ -1235,68 +1267,48 @@ const ( // the resulting attributes are. type ResourceClaimStatus struct { // DriverName is a copy of the driver name from the ResourceClass at - // the time when allocation started. It's necessary to enable usage - // of the claim by kubelet in case that the ResourceClass got - // removed in the meantime. It also helps the resource driver - // to determine whether it needs to handle a claim that got - // marked for deletion. + // the time when allocation started. + // +optional DriverName string // Allocation is set by the resource driver once a resource has been - // allocated successfully. Nil indicates that the resource is not - // allocated. + // allocated successfully. If this is not specified, the resource is + // not yet allocated. + // +optional Allocation *AllocationResult - // DeallocationRequested gets set by the scheduler when it detects - // the situation where pod scheduling cannot proceed because some - // claim was allocated for a node that cannot provide some other - // required resource. - // - // The driver then needs to deallocate this claim and the scheduler - // will try again. - // - // While DeallocationRequested is set, no new users may be added - // to ReservedFor. - DeallocationRequested bool - // ReservedFor indicates which entities are currently allowed to use - // the resource. Usually those are Pods, but other objects are - // also possible as long as they exist. The resource controller will - // remove all entries for objects that do not exist. + // the claim. A Pod which references a ResourceClaim which is not + // reserved for that Pod will not be started. // - // A scheduler must add a Pod that it is scheduling. This must be done - // in an atomic ResourceClaim update because there might be multiple - // schedulers working on different Pods that compete for access to the - // same ResourceClaim, the ResourceClaim might have been marked - // for deletion, or even been deallocated already. + // There can be at most 32 such reservations. This may get increased in + // the future, but not reduced. + // +optional + ReservedFor []ResourceClaimConsumerReference + + // DeallocationRequested indicates that a ResourceClaim is to be + // deallocated. // - // kubelet will check this before allowing a Pod to run because a - // a user might have selected a node manually without reserving - // resources or a - // scheduler might have missed that step, for example because it - // doesn't support dynamic resource allocation or the feature was - // disabled. + // The driver then must deallocate this claim and reset the field + // together with clearing the Allocation field. // - // The maximum size is 32 (= [ResourceClaimReservedForMaxSize]). - // This is an artificial limit to prevent - // a completely unbounded field in the API. - ReservedFor []ResourceClaimUserReference - - <<[UNRESOLVED pohly]>> - We will have to discuss use cases and real resource drivers that - support sharing before deciding on a) which limit is useful and - b) whether we need a different API that supports an unlimited - number of users. - - Any solution that handles reservations differently will have to - be very careful about race conditions. - <<[/UNRESOLVED]>> + // While DeallocationRequested is set, no new consumers may be added to + // ReservedFor. + // +optional + DeallocationRequested bool } -// ReservedForMaxSize is maximum number of entries in -// [ResourceClaimStatus.ReservedFor]. +// ReservedForMaxSize is the maximum number of entries in +// claim.status.reservedFor. const ResourceClaimReservedForMaxSize = 32 +``` +DeallocationRequested gets set by the scheduler when it detects +that pod scheduling cannot proceed because some +claim was allocated for a node for which some other pending claims +cannot be allocated because that node ran out of resources for those. + +``` // AllocationResult contains attributed of an allocated resource. type AllocationResult struct { // ResourceHandle contains arbitrary data returned by the driver after a @@ -1304,157 +1316,173 @@ type AllocationResult struct { // Kubernetes. Driver documentation may explain to users how to // interpret this data if needed. // - // Resource drivers can use this to store some data directly or - // cross-reference some other place where information is stored. - // This data is guaranteed to be available when a Pod is about - // to run on a node, in contrast to the ResourceClass which - // may have been deleted in the meantime. - // - // The maximum size of this field is 16KiB. + // The maximum size of this field is 16KiB. This may get + // increased in the future, but not reduced. + // +optional ResourceHandle string // This field will get set by the resource driver after it has // allocated the resource driver to inform the scheduler where it can // schedule Pods using the ResourceClaim. // - // Node-local resources can use the `kubernetes.io/hostname` label - // to select a specific node. - // - // Setting this field is optional. If nil, the resource is available + // Setting this field is optional. If null, the resource is available // everywhere. + // +optional AvailableOnNodes *core.NodeSelector - // SharedResource determines whether the resource supports more - // than one user at a time. - SharedResource bool + // Shareable determines whether the resource supports more + // than one consumer at a time. + // +optional + Shareable bool + + <<[UNRESOLVED pohly]>> + We will have to discuss use cases and real resource drivers that + support sharing before deciding on a) which limit is useful and + b) whether we need a different API that supports an unlimited + number of users. + + Any solution that handles reservations differently will have to + be very careful about race conditions. + <<[/UNRESOLVED]>> } -// PodScheduling objects get created by a scheduler when it handles -// a pod which uses one or more unallocated ResourceClaims with delayed -// allocation. +// ResourceHandleMaxSize is the maximum size of allocation.resourceHandle. +const ResourceHandleMaxSize = 16 * 1024 +``` + +Resource drivers can use the `ResourceHandle` to store some data directly or +cross-reference some other place where information is stored. +This data is guaranteed to be available when a Pod is about +to run on a node, in contrast to the ResourceClass which +may have been deleted in the meantime. It's also protected from +modification by a user, in contrast to an annotation. + +``` +// PodScheduling objects hold information that is needed to schedule +// a Pod with ResourceClaims that use "WaitForFirstConsumer" allocation +// mode. +// +// This is an alpha type and requires enabling the DynamicResourceAllocation +// feature gate. type PodScheduling struct { metav1.TypeMeta - - // The name must be the same as the corresponding Pod. - // That Pod must be listed as owner in OwnerReferences - // to ensure that the PodScheduling object gets deleted - // when no longer needed. Normally the scheduler will delete it. - // - // Drivers must ignore PodScheduling objects where the owning - // pod already got deleted because such objects are orphaned - // and will be removed soon. - // - // More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata + // Standard object metadata + // +optional metav1.ObjectMeta - // Spec is set and updated by the scheduler. + // Spec describes where resources for the Pod are needed. Spec PodSchedulingSpec - // Status is updated by resource drivers. + // Status describes where resources for the Pod can be allocated. Status PodSchedulingStatus } +``` -// PodSchedulingSpec contains the request for information about -// resources required by a pod and eventually communicates -// the decision of the scheduler to move ahead with pod scheduling -// for a specific node. -type PodSchedulingSpec { - // When allocation is delayed, the scheduler must set - // the node for which it wants the resource(s) to be allocated - // before the driver(s) start with allocation. - // - // The driver must ensure that the allocated resource - // is available on this node or update ResourceSchedulingStatus.UnsuitableNodes - // to indicate where allocation might succeed. - // - // When allocation succeeds, drivers should immediately add - // the pod to the ResourceClaimStatus.ReservedFor field - // together with setting ResourceClaimStatus.Allocated. This - // optimization may save scheduling attempts and roundtrips - // through the API server because the scheduler does not - // need to reserve the claim for the pod itself. - // - // The selected node may change over time, for example - // when the initial choice turns out to be unsuitable - // after all. Drivers must not reallocate for a different - // node when they see such a change because it would - // lead to race conditions. Instead, the scheduler - // will trigger deallocation of specific claims as - // needed through the ResourceClaimStatus.DeallocationRequested - // field. +PodScheduling objects get created by a scheduler when it processes +a pod which uses one or more unallocated ResourceClaims with delayed +allocation. + +The name of a PodScheduling object must be the same as the corresponding Pod. +That Pod must be listed as owner in OwnerReferences +to ensure that the PodScheduling object gets deleted +when no longer needed. Normally the scheduler will delete it. + +Drivers must ignore PodScheduling objects where the owning +pod already got deleted because such objects are orphaned +and will be removed soon. + +``` +// PodSchedulingSpec describes where resources for the Pod are needed. +type PodSchedulingSpec struct { + // SelectedNode is the node for which allocation of ResourceClaims that + // are referenced by the Pod and that use "WaitForFirstConsumer" + // allocation is to be attempted. SelectedNode string - // When allocation is delayed, and the scheduler needs to - // decide on which node a Pod should run, it will - // ask the driver(s) on which nodes the resource might be - // made available. To trigger that check, the scheduler - // provides the names of nodes which might be suitable - // for the Pod. Will be updated periodically until - // all resources are allocated. - // - // The ResourceClass.SuiteableNodes node selector can be - // used to filter out nodes based on labels. This prevents - // adding nodes here that the driver then would need to - // reject through UnsuitableNodes. + // PotentialNodes lists nodes where the Pod might be able to run. // - // The size of this field is limited to 256 (= - // [PodSchedulingNodeListMaxSize]). This is large enough for many - // clusters. Larger clusters may need more attempts to find a node that - // suits all pending resources. + // The size of this field is limited to 128. This is large enough for + // many clusters. Larger clusters may need more attempts to find a node + // that suits all pending resources. This may get increased in the + // future, but not reduced. + // +optional PotentialNodes []string } +``` + +When allocation is delayed, the scheduler must set +the `SelectedNode` for which it wants the resource(s) to be allocated +before the driver(s) start with allocation. +The scheduler also needs to decide on which node a Pod should run and will +ask the driver(s) on which nodes the resource might be +made available. To trigger that check, the scheduler +provides the names of nodes which might be suitable +for the Pod and will update that list periodically until +all resources are allocated. + +The driver must ensure that the allocated resource +is available on this node or update ResourceSchedulingStatus.UnsuitableNodes +to indicate where allocation might succeed. + +When allocation succeeds, drivers should immediately add +the pod to the ResourceClaimStatus.ReservedFor field +together with setting ResourceClaimStatus.Allocated. This +optimization may save scheduling attempts and roundtrips +through the API server because the scheduler does not +need to reserve the claim for the pod itself. + +The selected node may change over time, for example +when the initial choice turns out to be unsuitable +after all. Drivers must not reallocate for a different +node when they see such a change because it would +lead to race conditions. Instead, the scheduler +will trigger deallocation of specific claims as +needed through the ResourceClaimStatus.DeallocationRequested +field. + +The ResourceClass.SuiteableNodes node selector can be +used to filter out nodes based on labels. This prevents +adding nodes here that the driver then would need to +reject through UnsuitableNodes. -// PodSchedulingStatus is where resource drivers provide -// information about where they could allocate a resource -// and whether allocation failed. +``` +// PodSchedulingStatus describes where resources for the Pod can be allocated. type PodSchedulingStatus struct { - // Each resource driver is responsible for providing information about - // those resources in the Pod that the driver manages. It can skip - // adding that information when it already allocated the resource. - // - // A driver must add entries here for all its pending claims, even if - // the ResourceSchedulingStatus.UnsuitabeNodes field is empty, - // because the scheduler may decide to wait with selecting - // a node until it has information from all drivers. - // - // +listType=map - // +listMapKey=podResourceClaimName + // ResourceClaims describes resource availability for each + // pod.spec.resourceClaim entry where the corresponding ResourceClaim + // uses "WaitForFirstConsumer" allocation mode. // +optional - Claims []ResourceClaimSchedulingStatus + ResourceClaims []ResourceClaimSchedulingStatus // If there ever is a need to support other kinds of resources // than ResourceClaim, then new fields could get added here // for those other resources. } +``` + +Each resource driver is responsible for providing information about +those resources in the Pod that the driver manages. It can skip +adding this information once it has allocated the resource. + +A driver must add entries here for all its pending claims, even if +the ResourceSchedulingStatus.UnsuitabeNodes field is empty, +because the scheduler may decide to wait with selecting +a node until it has information from all drivers. -// ResourceClaimSchedulingStatus contains information about one -// particular claim while scheduling a pod. +``` +// ResourceClaimSchedulingStatus contains information about one particular +// ResourceClaim with "WaitForFirstConsumer" allocation mode. type ResourceClaimSchedulingStatus struct { - // PodResourceClaimName matches the PodResourceClaim.Name field. - PodResourceClaimName string - - // UnsuitableNodes lists nodes that the claim cannot be allocated for. - // Nodes listed here will be ignored by the scheduler when selecting a - // node for a Pod. All other nodes are potential candidates, either - // because no information is available yet or because allocation might - // succeed. - // - // A change of the PodSchedulingSpec.PotentialNodes field and/or a failed - // allocation attempt trigger an update of this field: the driver - // then checks all nodes listed in PotentialNodes and UnsuitableNodes - // and updates UnsuitableNodes. - // - // It must include the prior UnsuitableNodes in this check because the - // scheduler will not list those again in PotentialNodes but they might - // still be unsuitable. - // - // This can change, so the driver also must refresh this information - // periodically and/or after changing resource allocation for some - // other ResourceClaim until a node gets selected by the scheduler. + // Name matches the pod.spec.resourceClaims[*].Name field. + Name string + + // UnsuitableNodes lists nodes that the ResourceClaim cannot be + // allocated for. // - // The size of this field is limited to 256 (= - // [PodSchedulingNodeListMaxSize]), the same as for - // PodSchedulingSpec.PotentialNodes. + // The size of this field is limited to 128, the same as for + // PodSchedulingSpec.PotentialNodes. This may get increased in the + // future, but not reduced. + // +optional UnsuitableNodes []string } @@ -1462,147 +1490,101 @@ type ResourceClaimSchedulingStatus struct { // node lists that are stored in PodScheduling objects. This limit is part // of the API. const PodSchedulingNodeListMaxSize = 256 +``` -type PodSpec { - ... - // ResourceClaims defines which ResourceClaims must be allocated - // and reserved before the Pod is allowed to start. The resources - // will be made available to those containers which reference them - // by name. - // - // At most 32 entries are allowed. - ResourceClaims []PodResourceClaim - ... -} - -type ResourceRequirements { - Limits ResourceList - Requests ResourceList - ... - // The entries are the names of resources in PodSpec.ResourceClaims - // that are used by the container. - Claims []string +UnsuitableNodes lists nodes that the claim cannot be allocated for. +Nodes listed here will be ignored by the scheduler when selecting a +node for a Pod. All other nodes are potential candidates, either +because no information is available yet or because allocation might +succeed. - <<[UNRESOLVED]>> - If we need per-container parameters, then we will have to make - this a struct instead of a string and change the logic in kubelet - from "prepare one CDI file for all containers" to "prepare one CDI file - per container". This might imply changing how CDI information is - passed into runtimes (avoid writing files?). +A change to the PodSchedulingSpec.PotentialNodes field and/or a failed +allocation attempt triggers an update of this field: the driver +then checks all nodes listed in PotentialNodes and UnsuitableNodes +and updates UnsuitableNodes. - The current approach is simpler, so we will need to consider - use cases carefully before changing the design. - <<[/UNRESOLVED]>> +It must include the prior UnsuitableNodes in this check because the +scheduler will not list those again in PotentialNodes but they might +still be unsuitable. - ... -} +This can change, so the driver must also refresh this information +periodically and/or after changing resource allocation for some +other ResourceClaim until a node gets selected by the scheduler. -// PodResourceClaim references exactly one ResourceClaim, either by name or -// by embedding a template for a ResourceClaim that will get created -// by the resource claim controller in kube-controller-manager. -type PodResourceClaim struct { - // A name under which this resource can be referenced by the containers. - Name string - - // Claim determines where to find the claim. - Claim ClaimSource -} - -// ClaimSource either references one separate ResourceClaim by name or -// embeds a template for a ResourceClaim, but never both. -// -// Additional options might get added in the future, so code using this -// struct must error out when none of the options that it supports are set. -ClaimSource struct { - // The resource is independent of the Pod and defined by - // a separate ResourceClaim in the same namespace as - // the Pod. Either this or Template must be set, but not both. - ResourceClaimName *string +``` +// ResourceClaimTemplate is used to produce ResourceClaim objects. +type ResourceClaimTemplate struct { + metav1.TypeMeta + // Standard object metadata + // +optional + metav1.ObjectMeta - // Will be used to create a stand-alone ResourceClaim to allocate the resource. - // The pod in which this PodResource is embedded will be the - // owner of the ResourceClaim, i.e. the ResourceClaim will be deleted together with the - // pod. The name of the ResourceClaim will be `-` where - // `` is the name PodResource.Name - // Pod validation will reject the pod if the concatenated name - // is not valid for a ResourceClaim (for example, too long). + // Describes the ResourceClaim that is to be generated. // - // An existing ResourceClaim with that name that is not owned by the pod - // will *not* be used for the pod to avoid using an unrelated - // resource by mistake. Scheduling is then blocked until - // the unrelated ResourceClaim is removed. If such a pre-created ResourceClaim is - // meant to be used by the pod, the ResourceClaim has to be updated with an - // owner reference to the pod once the pod exists. Normally - // this should not be necessary, but it may be useful when - // manually reconstructing a broken cluster. - // - // Running the pod also gets blocked by a wrong ownership. This should - // be even less likely because of the prior scheduling check, but could - // happen if a user force-deletes or modifies the ResourceClaim. - // - // This field is read-only and no changes will be made by Kubernetes - // to the ResourceClaim after it has been created. - // Either this or ResourceClaimName must be set, but not both. - Template *ResourceClaimTemplate + // This field is immutable. A ResourceClaim will get created by the + // control plane for a Pod when needed and then not get updated + // anymore. + Spec ResourceClaimTemplateSpec } -// ResourceClaimTemplate is used to produce ResourceClaim objects by embedding -// such a template in the ResourceRequirements of a Pod. -type ResourceClaimTemplate struct { - // May contain labels and annotations that will be copied into the PVC +// ResourceClaimTemplateSpec contains the metadata and fields for a ResourceClaim. +type ResourceClaimTemplateSpec struct { + // ObjectMeta may contain labels and annotations that will be copied into the PVC // when creating it. No other fields are allowed and will be rejected during // validation. - // // +optional metav1.ObjectMeta - // The specification for the ResourceClaim. The entire content is - // copied unchanged into the PVC that gets created from this - // template. The same fields as in a ResourceClaim - // are also valid here. + // Spec for the ResourceClaim. The entire content is copied unchanged + // into the ResourceClaim that gets created from this template. The + // same fields as in a ResourceClaim are also valid here. Spec ResourceClaimSpec } -// ResourceClassParametersReference contains enough information to let you locate the -// parameters for a ResourceClass. +// ResourceClassParametersReference contains enough information to let you +// locate the parameters for a ResourceClass. type ResourceClassParametersReference struct { - // APIGroup is the group for the resource being referenced. - // If APIGroup is empty, the specified Kind must be in the core API group. - // For any other third-party types, APIGroup is required. + // APIGroup is the group for the resource being referenced. It is + // empty for the core API. This matches the group in the APIVersion + // that is used when creating the resources. // +optional APIGroup string - // Kind is the type of resource being referenced + // Kind is the type of resource being referenced. This is the same + // value as in the parameter object's metadata. Kind string - // Name is the name of resource being referenced + // Name is the name of resource being referenced. Name string // Namespace that contains the referenced resource. Must be empty // for cluster-scoped resources and non-empty for namespaced // resources. + // +optional Namespace string } -// ResourceClaimParametersReference contains enough information to let you locate the -// parameters for a ResourceClaim. The object must be in the same namespace -// as the ResourceClaim. +// ResourceClaimParametersReference contains enough information to let you +// locate the parameters for a ResourceClaim. The object must be in the same +// namespace as the ResourceClaim. type ResourceClaimParametersReference struct { - // APIGroup is the group for the resource being referenced. - // If APIGroup is empty, the specified Kind must be in the core API group. - // For any other third-party types, APIGroup is required. + // APIGroup is the group for the resource being referenced. It is + // empty for the core API. This matches the group in the APIVersion + // that is used when creating the resources. // +optional APIGroup string - // Kind is the type of resource being referenced + // Kind is the type of resource being referenced. This is the same + // value as in the parameter object's metadata, for example "ConfigMap". Kind string - // Name is the name of resource being referenced + // Name is the name of resource being referenced. Name string } -// ResourceClaimParametersReference contains enough information to let you -// locate the user of a ResourceClaim. The user must be a resource in the same +// ResourceClaimConsumerReference contains enough information to let you +// locate the consumer of a ResourceClaim. The user must be a resource in the same // namespace as the ResourceClaim. -type ResourceClaimUserReference struct { - // APIGroup is the API group for the resource being referenced. - // If Group is empty, the specified Kind must be in the core API group. - // For any other third-party types, APIGroup is required. +type ResourceClaimConsumerReference struct { + // APIGroup is the group for the resource being referenced. It is + // empty for the core API. This matches the group in the APIVersion + // that is used when creating the resources. + // +optional APIGroup string // Resource is the type of resource being referenced, for example "pods". Resource string @@ -1613,9 +1595,111 @@ type ResourceClaimUserReference struct { } ``` +`ResourceClassParametersReference` and `ResourceClaimParametersReference` use +the more user-friendly "kind" to identify the object type because those +references are provided by users. `ResourceClaimConsumerReference` is typically +set by the control plane and therefore uses the more technically correct +"resource" name. + +#### core + +``` +type PodSpec { + ... + // ResourceClaims defines which ResourceClaims must be allocated + // and reserved before the Pod is allowed to start. The resources + // will be made available to those containers which consume them + // by name. + // + // This is an alpha field and requires enabling the + // DynamicResourceAllocation feature gate. + // + // This field is immutable. + // + // +featureGate=DynamicResourceAllocation + // +optional + ResourceClaims []PodResourceClaim + ... +} + +type ResourceRequirements { + Limits ResourceList + Requests ResourceList + ... + // Claims lists the names of resources, defined in spec.resourceClaims, + // that are used by this container. + // + // This is an alpha field and requires enabling the + // DynamicResourceAllocation feature gate. + // + // This field is immutable. + // + // +featureGate=DynamicResourceAllocation + // +optional + Claims []ResourceClaim +} + +// ResourceClaim references one entry in PodSpec.ResourceClaims. +type ResourceClaim struct { + // Name must match the name of one entry in pod.spec.resourceClaims of + // the Pod where this field is used. It makes that resource available + // inside a container. + Name string +} +``` + +`Claims` is a list of structs with a single `Name` element because that struct +can be extended later, for example to add parameters that influence how the +resource is made available to a container. This wouldn't be possible if +it was a list of strings. + +``` +// PodResourceClaim references exactly one ResourceClaim through a ClaimSource. +// It adds a name to it that uniquely identifies the ResourceClaim inside the Pod. +// Containers that need access to the ResourceClaim reference it with this name. +type PodResourceClaim struct { + // Name uniquely identifies this resource claim inside the pod. + // This must be a DNS_LABEL. + Name string + + // Source describes where to find the ResourceClaim. + Source ClaimSource +} + +// ClaimSource describes a reference to a ResourceClaim. +// +// Exactly one of these fields should be set. Consumers of this type must +// treat an empty object as if it has an unknown value. +type ClaimSource struct { + // ResourceClaimName is the name of a ResourceClaim object in the same + // namespace as this pod. + ResourceClaimName *string + + // ResourceClaimTemplateName is the name of a ResourceClaimTemplate + // object in the same namespace as this pod. + // + // The template will be used to create a new ResourceClaim, which will + // be bound to this pod. When this pod is deleted, the ResourceClaim + // will also be deleted. The name of the ResourceClaim will be -, where is the + // PodResourceClaim.Name. Pod validation will reject the pod if the + // concatenated name is not valid for a ResourceClaim (e.g. too long). + // + // An existing ResourceClaim with that name that is not owned by the + // pod will not be used for the pod to avoid using an unrelated + // resource by mistake. Scheduling and pod startup are then blocked + // until the unrelated ResourceClaim is removed. + // + // This field is immutable and no changes will be made to the + // corresponding ResourceClaim by the control plane after creating the + // ResourceClaim. + ResourceClaimTemplateName *string +} +``` + ### kube-controller-manager -The code that creates a ResourceClaim from an inline ResourceClaimTemplate will +The code that creates a ResourceClaim from a ResourceClaimTemplate will be an almost verbatim copy of the [generic ephemeral volume code](https://github.com/kubernetes/kubernetes/tree/master/pkg/controller/volume/ephemeral), just with different types. @@ -2177,7 +2261,7 @@ For beta: - Tests are in Testgrid and linked in KEP - At least one scalability test for a likely scenario (for example, several pods each using different claims that get created from - inline templates) + templates) - Documentation for users and resource driver developers published - In addition to the basic features, we also handle: - reuse of network-attached resources after unexpected node shutdown @@ -2258,7 +2342,7 @@ when the feature is disabled. Workloads not using ResourceClaims should not be impacted because the new code will not do anything besides checking the Pod for ResourceClaims. -When kube-controller-manager fails to create ResourceClaims from inline +When kube-controller-manager fails to create ResourceClaims from ResourceClaimTemplates, those Pods will not get scheduled. Bugs in kube-scheduler might lead to not scheduling Pods that could run or worse, schedule Pods that should not run. Those then will get stuck on a node where @@ -2463,7 +2547,7 @@ Why should this KEP _not_ be implemented? ### ResourceClaimTemplate -Instead of creating a ResourceClaim from an embedded template, the +Instead of creating a ResourceClaim from a template, the PodStatus could be extended to hold the same information as a ResourceClaimStatus. Every component which works with that information then needs permission and extra code to work with PodStatus. Creating