Skip to content

Commit

Permalink
Add Resource Class proposal
Browse files Browse the repository at this point in the history
Signed-off-by: vikaschoudhary16 <[email protected]>
  • Loading branch information
vikaschoudhary16 committed Jul 11, 2017
1 parent 3ef47e3 commit 53c2a80
Showing 1 changed file with 363 additions and 0 deletions.
363 changes: 363 additions & 0 deletions contributors/design-proposals/resource-class.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,363 @@
# Resource Classes Proposal

1. [Abstract](#abstract)
2. [Motivation](#motivation)
3. [Use Cases](#use-cases)
4. [Objectives](#objectives)
5. [Non Objectives](#non-objectives)
6. [Resource Class](#resource-class)
7. [API Changes](#api-changes)
8. [Scheduler Changes](#sch-changes)
9. [Kubelet Changes](#kubelet-changes)
10. [Opaque Integer Resources](#oir)
11. [Future Scope](#future-scope)

_Authors:_

* @vikaschoudhary16 - Vikas Choudhary &lt;[email protected]&gt;
* @aveshagarwal - Avesh Agarwal &lt;[email protected]&gt;

## Abstract
In this document we will describe *resource classes* which is a new model to
represent compute resources in Kubernetes. This document should be seen as a
successor to [device plugin proposal](https://github.com/kubernetes/community/pull/695/files)
and has a dependency on the same.

## Motivation
Kubernetes system knows only two resource types, 'CPU' and 'Memory'. Any other
resource can be requested by pod using opaque-integer-resource(OIR) mechanism.
OIR is a key-value pair with the key being a string and the value being a 'Quantity'
which can (optionally) be fractional. The current model is great for supporting
simple compute resources like CPU or Memory, which are available across all
kubernetes deployments.
But there is a problem in representing resources like GPUs, ASICs, local storage
etc in the form of OIRs. Each of such resource type generally has a rich
metadata like version, capabilities etc. to describe the resource. A particular
application pod may perform well only with resources which have a certain
capability, for example GPUs greater than version 'V'. OIR does not allow such,
metadata based, selection/filtering of resources.
The current model requires identity mapping between available resources and requested
resources. Identity mapping means there must be a one-to-one mapping between the
resource reference used in the spec to request the resource and resource reference
used to advertise the resource availablity.

Since 'CPU' and 'Memory' are resources that are available across all kubernetes
deployments and need no metadata to describe the resource type any further,
the current user facing API (Pod Specification) remains portable as long as pod
requests only CPU and Memory. However the current model cannot support
complex resources like GPUs, ASICs, NICs, local storage, etc.
To support heterogeneity, portability and management at scale, such resources
must be represented(advertised) in a form which allows metadata inclusion and a
metadata based resource selection mechanism.

_GPU Integration Example:_
* [Enable "kick the tires" support for Nvidia GPUs in COS](https://github.com/kubernetes/kubernetes/pull/45136)
* [Extend experimental support to multiple Nvidia GPUs](https://github.com/kubernetes/kubernetes/pull/42116)

_Kubernetes Meeting Notes On This:_
* [Meeting notes](https://docs.google.com/document/d/1Qg42Nmv-QwL4RxicsU2qtZgFKOzANf8fGayw8p3lX6U/edit#)
* [Better Abstraction for Compute Resources in Kubernetes](https://docs.google.com/document/d/1666PPUs4Lz56TqKygcy6mXkNazde-vwA7q4e5H92sUc)
* [Extensible support for hardware devices in Kubernetes (join [email protected] for access)](https://docs.google.com/document/d/1LHeTPx_fWA1PdZkHuALPzYxR0AYXUiiXdo3S0g2VSlo/edit)

## Use Cases

* I want to have a compute resource type, 'Resource Class',
which can be created with meaningful and portable names. This compute
resource can hold additional metadata as well, for example:
* `nvidia.gpu.high.mem` is the name and metadata is memory greater than 'X' GB.
* `fast.nic` is the name and associated metadata is bandwidth greater than
'B' gbps.
* If I request a resource `nvidia.gpu.high.mem` for my pod, any 'nvidia-gpu'
type device which has memory greater than or equal to 'X' GB, should be able
to satisfy this request, independent of other device capabilities such as
'version' or 'nvlink locality' etc.
* Similarly, if I request a resource `fast.nic`, any nic device with speed
greater than 'B' gbps should be able to meet the request.
* I want a rich metadata selection interface where operators like 'Eq' for
'equals to', 'Lt' for 'less than', 'LtEq' for 'less than equal to', 'Gt' for
'greater than', 'GtEq' for 'greater than and equal to' and 'In' for
'a set of accepted values' are supported on the compute resource metadata.

## Objectives

1. Define and add support in the API for a new type, *Resource Class*.
2. Add support for *Resource Class* in the scheduler.

## Non Objectives
1. Discovery, advertisement, allocation/deallocation of devices is expected to
be addressed by [device plugin proposal](https://github.com/kubernetes/community/pull/695/files)

## Resource Class
*Resource Class* is a new type, objects of which provide abstraction over
[Devices](https://github.com/RenaudWasTaken/community/blob/a7762d8fa80b9a805dbaa7deb510e95128905148/contributors/design-proposals/device-plugin.md#resourcetype).
A *Resource Class* object selects devices using `matchExpressions`, a list of
(operator, key, value). A *Resource Class* object selects a device if at least
one of the `matchExpressions` matches with device details. Within a matchExpression,
all the (operator,key,value) are ANDed together to evaluate the result.

*Resource Class* object is non-Namespaced kind of object and post created object
is immutable.

YAML example 1:
```yaml
kind: ResourceClass
metadata:
name: nvidia.high.mem
spec:
resourceSelector:
- matchExpressions:
- key: "Kind"
operator: "In"
values:
- "nvidia-gpu"
- key: "memory"
operator: "GtEq"
values:
- "30G"
```
Above resource class will select all the nvidia-gpus which have memory greater
than and equal to 30 GB.
YAML example 2:
```yaml
kind: ResourceClass
metadata:
name: hugepages-1gig
spec:
resourceSelector:
- matchExpressions:
- key: "Kind"
operator: "In"
values:
- "huge-pages"
- key: "size"
operator: "GtEq"
values:
- "1G"
```
Above resource class will select all the hugepages with size greater than and
equal to 1 GB.
YAML example 3:
```yaml
kind: ResourceClass
metadata:
name: fast.nic
spec:
resourceSelector:
- matchExpressions:
- key: "Kind"
operator: "In"
values:
- "nic"
- key: "speed"
operator: "GtEq"
values:
- "40GBPS"
```
Above resource class will select all the NICs with speed greater than equal to
40 GBPS.
## API Changes
### ResourceClass
Internal representation of *Resource Class*:
```golang
// +nonNamespaced=true
// +genclient=true

type ResourceClass struct {
metav1.TypeMeta
metav1.ObjectMeta
// Spec defines resources required
Spec ResourceClassSpec
// +optional
Status ResourceClassStatus
}
// Spec defines resources required
type ResourceClassSpec struct {
// Resource Selector selects resources
ResourceSelector []ResourcePropertySelector
}

// A null or empty selector matches no resources
type ResourcePropertySelector struct {
// A list of resource/device selector requirements. ANDed from each ResourceSelectorRequirement
MatchExpressions []ResourceSelectorRequirement
}

// A resource selector requirement is a selector that contains values, a key, and an operator
// that relates the key and values
type ResourceSelectorRequirement struct {
// The label key that the selector applies to
// +patchMergeKey=key
// +patchStrategy=merge
Key string
// +optional
Values []string
// operator
Operator ResourceSelectorOperator
}
type ResourceSelectorOperator string

const (
ResourceSelectorOpIn ResourceSelectorOperator = "In"
ResourceSelectorOpEq ResourceSelectorOperator = "Eq"
ResourceSelectorOpNotIn ResourceSelectorOperator = "NotIn"
ResourceSelectorOpExists ResourceSelectorOperator = "Exists"
ResourceSelectorOpDoesNotExist ResourceSelectorOperator = "DoesNotExist"
ResourceSelectorOpGt ResourceSelectorOperator = "Gt"
ResourceSelectorOpGtEq ResourceSelectorOperator = "GtEq"
ResourceSelectorOpLt ResourceSelectorOperator = "Lt"
ResourceSelectorOpLtEq ResourceSelectorOperator = "LtEq"
)
```
### ResourceClassStatus
```golang
type ResourceClassStatus struct {
Allocatable resources.Quantity
Requested resources.Quantity
}
```
ResourceClass status is updated by the scheduler at:
1. New *Resource Class* object creation.
2. Node addition to the cluster.
3. Node removal from the cluster.
4. Pod creation if pod requests a resource class.
5. Pod deletion if pod was consuming resource class.

`ResourceClassStatus` serves the following two purposes:
* Scheduler predicates evaluation while pod creation. For details, please refer
further sections
* User can view the current usage/availability details about the resource class
using kubectl.

### User story
The administrator has deployed device plugins to support hardware present in the
cluster. Device plugins, running on nodes, will update node status indicating
the presence of this hardware. To offer this hardware to applications deployed
on kubernetes in a portable way, the administrator creates a number of resource
classes to represent that hardware. These resource classes will include metadata
about the devices as selection criteria.

1. A user submits a pod spec requesting 'X' resource classes.
2. The scheduler filters the nodes which do not match the resource requests.
3. scheduler selects a device for each resource class requested and annotates
the pod object with device selection info. eg:
`scheduler.alpha.kubernetes.io/resClass_test-res-class_nvidia-tesla-gpu=4`
where `scheduler.alpha.kubernetes.io/resClass` is the common prefix for all the
device annotations, `tes-res-class` is resource class name,
`nvidia-tesla-gpu` is the selected device name and `4` is the quantity requested.

4. Kubelet reads the device request from pod annotation and calls `Allocate` on
the matching Device Plugins.
5. The user deletes the pod or the pod terminates
6. Kubelet reads pod object annotation for devices consumed and calls `Deallocate`
on the matching Device Plugins

In addition to node selection, the scheduler is also responsible for selecting a
device that matches the resource class requested by the user.

### Reason for preferring device selection at the scheduler and not at the kubelet
Kubelet does not maintain any cache. Therefore to know the availability of a
device, which is requested by the new incoming pod, kubelet calculates how many
devices are consumed by all already admitted pods, by iterating over all the admitted
pods running on the node. This is done while running predicates for each new
incoming pod at kubelet. Even if we assume that scheduler cache and consumption
state that is created at runtime for each pod, are exactly same, current api
interfaces does not allow to pass selected device to container manager (where
actually device plugin will be invoked from). This problem occurs because
requested resource classes are translated into devices internally through code
and user does not mention device in pod object. While other resource requests
can be determined from the pod object directly.
To summarize, device selection at the kubelet can be done in one of the following
two ways:
* Select device at pod admission while applying predicates and change all api
interfaces that are required to pass selected device to container manager.
* Create resource consumption state again at container manager and select device.

None of the above approach seems cleaner than doing device selection at scheduler,
which helps to retain cleaner api interfaces between packages.

## Scheduler Changes
Scheduler already listens and maintains state in the cache for any changes in
node or pod objects. We will enhance the logic:
1. To listen and maintain the state in cache for user created *Resource Class* objects.
2. To look for device related details in node objects and maintain accounting for
devices as well.

From the events perspective, handling for the following events will be added/updated:

### Resource Class Creation
1. Initialize and add resource class info into local cache
2. Iterate over all existing nodes in cache to figure out if there are devices
on these nodes which are selectable by resource class. If found, update the
resource class availability status in local cache.
3. Patch the status of resource class api object with availability state in local
cache

### Resource Class Deletion
Delete the resource class info from the cache.

### Node Addition
Scheduler already caches `NodeInfo`. Now additionally update device state:
1. Check in the node status if any devices are present.
2. For each device found, iterate over all existing resource classes in the cache
to find resource classes which can select this particular device. For all
such resource classes, update the availability state in the local cache.
3. ResourceClass api object's status, `ResourceClassStatus` will be patched
as per the updated availability state in the cache.

### Node Deletion
If node has devices which are selectable by existing resource classes:
1. Adjust resource class state in local cache.
2. Update resource class status by patching api object.

### Pod Creation
1. Get the requested resource class name and quantity from pod spec.
2. Select nodes by applying predicates according to requested quantity and Resource
class's state present in the cache.
3. On the selected node, select a Device from the stored devices info in cache
after matching key,value from requested resource class.
4. After device selection, update(decrease) 'Requested' for all the resource
classes which could select this device in the cache.
5. Patch the resource class objects with new 'Requested' in the `ResourceClassStatus`.
6. Add the pod reference in local DeviceToPod mapping structure in the cache.
7. Patch the pod object with selected device annotation with prefix 'scheduler.alpha.kubernetes.io/resClass'

NOTE: This proposal propose only 'first fit' as device selection strategy.
In the future, this can be extended to multiple algorithms available for
the user to choose from, in a configurable manner.

### Pod Delete
1. Iterate over the all the devices on the at which pod was scheduled to and
find out the devices being used by pod.
2. For each device consumed by pod, update availability state of Resource classes
which can select this device in the cache.
3. Patch `ResourceClassStatus` with new availability state.

## Kubelet Changes
Update logic at container runtime manager to look for device annotations,
prefixed by 'scheduler.alpha.kubernetes.io/resClass' and call matching device
plugins.

## Opaque Integer Resources
This API will supersede the [Opaque Integer Resources](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#opaque-integer-resources-alpha-feature)
(OIR). External agents can continue to attach additional 'opaque' resources to
nodes, but the special naming scheme that is part of the current OIR approach
will no longer be necessary. Any existing resource discovery tool which updates
node objects with OIR, will adapt to update node status with devices instead.


## Future Scope
* RBAC: It can further be explored that how to tie resource classes with RBAC
like any other existing API resource objects.
* Nested Resource Classes: In future device plugins and resource classes can be
extended to support the nested resource class functionality where one resource
class could be comprised of a group of sub-resource classes. For example 'numa-node'
resource class comprised of sub-resource classes, 'single-core'.
* Multiple device selection algorithms, each with a different selection strategy,
will be added to the scheduler and cluster admin will be able to configure one
as per his/her choice.

0 comments on commit 53c2a80

Please sign in to comment.