-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: vikaschoudhary16 <[email protected]>
- Loading branch information
1 parent
3ef47e3
commit 53c2a80
Showing
1 changed file
with
363 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,363 @@ | ||
# Resource Classes Proposal | ||
|
||
1. [Abstract](#abstract) | ||
2. [Motivation](#motivation) | ||
3. [Use Cases](#use-cases) | ||
4. [Objectives](#objectives) | ||
5. [Non Objectives](#non-objectives) | ||
6. [Resource Class](#resource-class) | ||
7. [API Changes](#api-changes) | ||
8. [Scheduler Changes](#sch-changes) | ||
9. [Kubelet Changes](#kubelet-changes) | ||
10. [Opaque Integer Resources](#oir) | ||
11. [Future Scope](#future-scope) | ||
|
||
_Authors:_ | ||
|
||
* @vikaschoudhary16 - Vikas Choudhary <[email protected]> | ||
* @aveshagarwal - Avesh Agarwal <[email protected]> | ||
|
||
## Abstract | ||
In this document we will describe *resource classes* which is a new model to | ||
represent compute resources in Kubernetes. This document should be seen as a | ||
successor to [device plugin proposal](https://github.com/kubernetes/community/pull/695/files) | ||
and has a dependency on the same. | ||
|
||
## Motivation | ||
Kubernetes system knows only two resource types, 'CPU' and 'Memory'. Any other | ||
resource can be requested by pod using opaque-integer-resource(OIR) mechanism. | ||
OIR is a key-value pair with the key being a string and the value being a 'Quantity' | ||
which can (optionally) be fractional. The current model is great for supporting | ||
simple compute resources like CPU or Memory, which are available across all | ||
kubernetes deployments. | ||
But there is a problem in representing resources like GPUs, ASICs, local storage | ||
etc in the form of OIRs. Each of such resource type generally has a rich | ||
metadata like version, capabilities etc. to describe the resource. A particular | ||
application pod may perform well only with resources which have a certain | ||
capability, for example GPUs greater than version 'V'. OIR does not allow such, | ||
metadata based, selection/filtering of resources. | ||
The current model requires identity mapping between available resources and requested | ||
resources. Identity mapping means there must be a one-to-one mapping between the | ||
resource reference used in the spec to request the resource and resource reference | ||
used to advertise the resource availablity. | ||
|
||
Since 'CPU' and 'Memory' are resources that are available across all kubernetes | ||
deployments and need no metadata to describe the resource type any further, | ||
the current user facing API (Pod Specification) remains portable as long as pod | ||
requests only CPU and Memory. However the current model cannot support | ||
complex resources like GPUs, ASICs, NICs, local storage, etc. | ||
To support heterogeneity, portability and management at scale, such resources | ||
must be represented(advertised) in a form which allows metadata inclusion and a | ||
metadata based resource selection mechanism. | ||
|
||
_GPU Integration Example:_ | ||
* [Enable "kick the tires" support for Nvidia GPUs in COS](https://github.com/kubernetes/kubernetes/pull/45136) | ||
* [Extend experimental support to multiple Nvidia GPUs](https://github.com/kubernetes/kubernetes/pull/42116) | ||
|
||
_Kubernetes Meeting Notes On This:_ | ||
* [Meeting notes](https://docs.google.com/document/d/1Qg42Nmv-QwL4RxicsU2qtZgFKOzANf8fGayw8p3lX6U/edit#) | ||
* [Better Abstraction for Compute Resources in Kubernetes](https://docs.google.com/document/d/1666PPUs4Lz56TqKygcy6mXkNazde-vwA7q4e5H92sUc) | ||
* [Extensible support for hardware devices in Kubernetes (join [email protected] for access)](https://docs.google.com/document/d/1LHeTPx_fWA1PdZkHuALPzYxR0AYXUiiXdo3S0g2VSlo/edit) | ||
|
||
## Use Cases | ||
|
||
* I want to have a compute resource type, 'Resource Class', | ||
which can be created with meaningful and portable names. This compute | ||
resource can hold additional metadata as well, for example: | ||
* `nvidia.gpu.high.mem` is the name and metadata is memory greater than 'X' GB. | ||
* `fast.nic` is the name and associated metadata is bandwidth greater than | ||
'B' gbps. | ||
* If I request a resource `nvidia.gpu.high.mem` for my pod, any 'nvidia-gpu' | ||
type device which has memory greater than or equal to 'X' GB, should be able | ||
to satisfy this request, independent of other device capabilities such as | ||
'version' or 'nvlink locality' etc. | ||
* Similarly, if I request a resource `fast.nic`, any nic device with speed | ||
greater than 'B' gbps should be able to meet the request. | ||
* I want a rich metadata selection interface where operators like 'Eq' for | ||
'equals to', 'Lt' for 'less than', 'LtEq' for 'less than equal to', 'Gt' for | ||
'greater than', 'GtEq' for 'greater than and equal to' and 'In' for | ||
'a set of accepted values' are supported on the compute resource metadata. | ||
|
||
## Objectives | ||
|
||
1. Define and add support in the API for a new type, *Resource Class*. | ||
2. Add support for *Resource Class* in the scheduler. | ||
|
||
## Non Objectives | ||
1. Discovery, advertisement, allocation/deallocation of devices is expected to | ||
be addressed by [device plugin proposal](https://github.com/kubernetes/community/pull/695/files) | ||
|
||
## Resource Class | ||
*Resource Class* is a new type, objects of which provide abstraction over | ||
[Devices](https://github.com/RenaudWasTaken/community/blob/a7762d8fa80b9a805dbaa7deb510e95128905148/contributors/design-proposals/device-plugin.md#resourcetype). | ||
A *Resource Class* object selects devices using `matchExpressions`, a list of | ||
(operator, key, value). A *Resource Class* object selects a device if at least | ||
one of the `matchExpressions` matches with device details. Within a matchExpression, | ||
all the (operator,key,value) are ANDed together to evaluate the result. | ||
|
||
*Resource Class* object is non-Namespaced kind of object and post created object | ||
is immutable. | ||
|
||
YAML example 1: | ||
```yaml | ||
kind: ResourceClass | ||
metadata: | ||
name: nvidia.high.mem | ||
spec: | ||
resourceSelector: | ||
- matchExpressions: | ||
- key: "Kind" | ||
operator: "In" | ||
values: | ||
- "nvidia-gpu" | ||
- key: "memory" | ||
operator: "GtEq" | ||
values: | ||
- "30G" | ||
``` | ||
Above resource class will select all the nvidia-gpus which have memory greater | ||
than and equal to 30 GB. | ||
YAML example 2: | ||
```yaml | ||
kind: ResourceClass | ||
metadata: | ||
name: hugepages-1gig | ||
spec: | ||
resourceSelector: | ||
- matchExpressions: | ||
- key: "Kind" | ||
operator: "In" | ||
values: | ||
- "huge-pages" | ||
- key: "size" | ||
operator: "GtEq" | ||
values: | ||
- "1G" | ||
``` | ||
Above resource class will select all the hugepages with size greater than and | ||
equal to 1 GB. | ||
YAML example 3: | ||
```yaml | ||
kind: ResourceClass | ||
metadata: | ||
name: fast.nic | ||
spec: | ||
resourceSelector: | ||
- matchExpressions: | ||
- key: "Kind" | ||
operator: "In" | ||
values: | ||
- "nic" | ||
- key: "speed" | ||
operator: "GtEq" | ||
values: | ||
- "40GBPS" | ||
``` | ||
Above resource class will select all the NICs with speed greater than equal to | ||
40 GBPS. | ||
## API Changes | ||
### ResourceClass | ||
Internal representation of *Resource Class*: | ||
```golang | ||
// +nonNamespaced=true | ||
// +genclient=true | ||
|
||
type ResourceClass struct { | ||
metav1.TypeMeta | ||
metav1.ObjectMeta | ||
// Spec defines resources required | ||
Spec ResourceClassSpec | ||
// +optional | ||
Status ResourceClassStatus | ||
} | ||
// Spec defines resources required | ||
type ResourceClassSpec struct { | ||
// Resource Selector selects resources | ||
ResourceSelector []ResourcePropertySelector | ||
} | ||
|
||
// A null or empty selector matches no resources | ||
type ResourcePropertySelector struct { | ||
// A list of resource/device selector requirements. ANDed from each ResourceSelectorRequirement | ||
MatchExpressions []ResourceSelectorRequirement | ||
} | ||
|
||
// A resource selector requirement is a selector that contains values, a key, and an operator | ||
// that relates the key and values | ||
type ResourceSelectorRequirement struct { | ||
// The label key that the selector applies to | ||
// +patchMergeKey=key | ||
// +patchStrategy=merge | ||
Key string | ||
// +optional | ||
Values []string | ||
// operator | ||
Operator ResourceSelectorOperator | ||
} | ||
type ResourceSelectorOperator string | ||
|
||
const ( | ||
ResourceSelectorOpIn ResourceSelectorOperator = "In" | ||
ResourceSelectorOpEq ResourceSelectorOperator = "Eq" | ||
ResourceSelectorOpNotIn ResourceSelectorOperator = "NotIn" | ||
ResourceSelectorOpExists ResourceSelectorOperator = "Exists" | ||
ResourceSelectorOpDoesNotExist ResourceSelectorOperator = "DoesNotExist" | ||
ResourceSelectorOpGt ResourceSelectorOperator = "Gt" | ||
ResourceSelectorOpGtEq ResourceSelectorOperator = "GtEq" | ||
ResourceSelectorOpLt ResourceSelectorOperator = "Lt" | ||
ResourceSelectorOpLtEq ResourceSelectorOperator = "LtEq" | ||
) | ||
``` | ||
### ResourceClassStatus | ||
```golang | ||
type ResourceClassStatus struct { | ||
Allocatable resources.Quantity | ||
Requested resources.Quantity | ||
} | ||
``` | ||
ResourceClass status is updated by the scheduler at: | ||
1. New *Resource Class* object creation. | ||
2. Node addition to the cluster. | ||
3. Node removal from the cluster. | ||
4. Pod creation if pod requests a resource class. | ||
5. Pod deletion if pod was consuming resource class. | ||
|
||
`ResourceClassStatus` serves the following two purposes: | ||
* Scheduler predicates evaluation while pod creation. For details, please refer | ||
further sections | ||
* User can view the current usage/availability details about the resource class | ||
using kubectl. | ||
|
||
### User story | ||
The administrator has deployed device plugins to support hardware present in the | ||
cluster. Device plugins, running on nodes, will update node status indicating | ||
the presence of this hardware. To offer this hardware to applications deployed | ||
on kubernetes in a portable way, the administrator creates a number of resource | ||
classes to represent that hardware. These resource classes will include metadata | ||
about the devices as selection criteria. | ||
|
||
1. A user submits a pod spec requesting 'X' resource classes. | ||
2. The scheduler filters the nodes which do not match the resource requests. | ||
3. scheduler selects a device for each resource class requested and annotates | ||
the pod object with device selection info. eg: | ||
`scheduler.alpha.kubernetes.io/resClass_test-res-class_nvidia-tesla-gpu=4` | ||
where `scheduler.alpha.kubernetes.io/resClass` is the common prefix for all the | ||
device annotations, `tes-res-class` is resource class name, | ||
`nvidia-tesla-gpu` is the selected device name and `4` is the quantity requested. | ||
|
||
4. Kubelet reads the device request from pod annotation and calls `Allocate` on | ||
the matching Device Plugins. | ||
5. The user deletes the pod or the pod terminates | ||
6. Kubelet reads pod object annotation for devices consumed and calls `Deallocate` | ||
on the matching Device Plugins | ||
|
||
In addition to node selection, the scheduler is also responsible for selecting a | ||
device that matches the resource class requested by the user. | ||
|
||
### Reason for preferring device selection at the scheduler and not at the kubelet | ||
Kubelet does not maintain any cache. Therefore to know the availability of a | ||
device, which is requested by the new incoming pod, kubelet calculates how many | ||
devices are consumed by all already admitted pods, by iterating over all the admitted | ||
pods running on the node. This is done while running predicates for each new | ||
incoming pod at kubelet. Even if we assume that scheduler cache and consumption | ||
state that is created at runtime for each pod, are exactly same, current api | ||
interfaces does not allow to pass selected device to container manager (where | ||
actually device plugin will be invoked from). This problem occurs because | ||
requested resource classes are translated into devices internally through code | ||
and user does not mention device in pod object. While other resource requests | ||
can be determined from the pod object directly. | ||
To summarize, device selection at the kubelet can be done in one of the following | ||
two ways: | ||
* Select device at pod admission while applying predicates and change all api | ||
interfaces that are required to pass selected device to container manager. | ||
* Create resource consumption state again at container manager and select device. | ||
|
||
None of the above approach seems cleaner than doing device selection at scheduler, | ||
which helps to retain cleaner api interfaces between packages. | ||
|
||
## Scheduler Changes | ||
Scheduler already listens and maintains state in the cache for any changes in | ||
node or pod objects. We will enhance the logic: | ||
1. To listen and maintain the state in cache for user created *Resource Class* objects. | ||
2. To look for device related details in node objects and maintain accounting for | ||
devices as well. | ||
|
||
From the events perspective, handling for the following events will be added/updated: | ||
|
||
### Resource Class Creation | ||
1. Initialize and add resource class info into local cache | ||
2. Iterate over all existing nodes in cache to figure out if there are devices | ||
on these nodes which are selectable by resource class. If found, update the | ||
resource class availability status in local cache. | ||
3. Patch the status of resource class api object with availability state in local | ||
cache | ||
|
||
### Resource Class Deletion | ||
Delete the resource class info from the cache. | ||
|
||
### Node Addition | ||
Scheduler already caches `NodeInfo`. Now additionally update device state: | ||
1. Check in the node status if any devices are present. | ||
2. For each device found, iterate over all existing resource classes in the cache | ||
to find resource classes which can select this particular device. For all | ||
such resource classes, update the availability state in the local cache. | ||
3. ResourceClass api object's status, `ResourceClassStatus` will be patched | ||
as per the updated availability state in the cache. | ||
|
||
### Node Deletion | ||
If node has devices which are selectable by existing resource classes: | ||
1. Adjust resource class state in local cache. | ||
2. Update resource class status by patching api object. | ||
|
||
### Pod Creation | ||
1. Get the requested resource class name and quantity from pod spec. | ||
2. Select nodes by applying predicates according to requested quantity and Resource | ||
class's state present in the cache. | ||
3. On the selected node, select a Device from the stored devices info in cache | ||
after matching key,value from requested resource class. | ||
4. After device selection, update(decrease) 'Requested' for all the resource | ||
classes which could select this device in the cache. | ||
5. Patch the resource class objects with new 'Requested' in the `ResourceClassStatus`. | ||
6. Add the pod reference in local DeviceToPod mapping structure in the cache. | ||
7. Patch the pod object with selected device annotation with prefix 'scheduler.alpha.kubernetes.io/resClass' | ||
|
||
NOTE: This proposal propose only 'first fit' as device selection strategy. | ||
In the future, this can be extended to multiple algorithms available for | ||
the user to choose from, in a configurable manner. | ||
|
||
### Pod Delete | ||
1. Iterate over the all the devices on the at which pod was scheduled to and | ||
find out the devices being used by pod. | ||
2. For each device consumed by pod, update availability state of Resource classes | ||
which can select this device in the cache. | ||
3. Patch `ResourceClassStatus` with new availability state. | ||
|
||
## Kubelet Changes | ||
Update logic at container runtime manager to look for device annotations, | ||
prefixed by 'scheduler.alpha.kubernetes.io/resClass' and call matching device | ||
plugins. | ||
|
||
## Opaque Integer Resources | ||
This API will supersede the [Opaque Integer Resources](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#opaque-integer-resources-alpha-feature) | ||
(OIR). External agents can continue to attach additional 'opaque' resources to | ||
nodes, but the special naming scheme that is part of the current OIR approach | ||
will no longer be necessary. Any existing resource discovery tool which updates | ||
node objects with OIR, will adapt to update node status with devices instead. | ||
|
||
|
||
## Future Scope | ||
* RBAC: It can further be explored that how to tie resource classes with RBAC | ||
like any other existing API resource objects. | ||
* Nested Resource Classes: In future device plugins and resource classes can be | ||
extended to support the nested resource class functionality where one resource | ||
class could be comprised of a group of sub-resource classes. For example 'numa-node' | ||
resource class comprised of sub-resource classes, 'single-core'. | ||
* Multiple device selection algorithms, each with a different selection strategy, | ||
will be added to the scheduler and cluster admin will be able to configure one | ||
as per his/her choice. |