Skip to content

Commit

Permalink
docs: RFC for NodeOverlay
Browse files Browse the repository at this point in the history
  • Loading branch information
ellistarn committed Jun 20, 2024
1 parent 24c9761 commit 4573cb8
Show file tree
Hide file tree
Showing 6 changed files with 558 additions and 0 deletions.
186 changes: 186 additions & 0 deletions designs/node-overlays.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
# Node Overlays

Node launch decisions are the output a scheduling simulation algorithm that involves satisfying pod scheduling constraints with available instance type offerings. Cloud provider implementations are responsible for surfacing these offerings to the Karpenter scheduler through the [cloudprovider.GetInstanceTypes()](https://github.com/kubernetes-sigs/karpenter/blob/37db06f4742eada19934f76e616f2b1860aca57b/pkg/cloudprovider/types.go#L59) API. This separation of responsibilities enables cloud providers to make simplifying assumptions, but limits extensibility for more advanced use cases. NodeOverlays enable customers to inject alternative assumptions into the scheduling simulation for more accurate scheduling simulations.

## Use Cases

### Price

Each instance offering comes with a price, which is used as a minimization function in Karpenter's scheduling algorithm. However, Karpenter cannot perfectly understand all factors that influence price, such as an external vendor's fee, a pricing discount, or an external motiviation such as carbon-offset.

* https://github.com/aws/karpenter-provider-aws/pull/4686
* https://github.com/aws/karpenter-provider-aws/issues/3860
* https://github.com/aws/karpenter-provider-aws/pull/4697

### Extended Resources

InstanceTypes have a set of well known resource capacities, such as CPU, memory, or GPUs. However, Kubernetes supports arbitrary [extended resources](https://kubernetes.io/docs/tasks/administer-cluster/extended-resource-node/), which cannot be known prior to Node launch. Karpenter lacks a mechanism to teach the simulation about extended resource capacities.

* https://github.com/kubernetes-sigs/karpenter/issues/751
* https://github.com/kubernetes-sigs/karpenter/issues/729

### Resource Overhead

* https://github.com/aws/karpenter-provider-aws/issues/5161

System software such as containerd, kubelet, and the host operating system come with resource overhead like CPU and memory, which is substracted from a Node's capacity. Karpenter attempts to calculate this overhead where it is known, but cannot predict the overhead required by custom operating systems. To complicate this, the overhead often changes for each version of an operating system.

## Proposal

NodeOverlays are a new Karpenter API that enable customers to fine-tune the scheduling simulation for advanced use cases. NodeOverlays can be configured with a `metav1.Selector` which flexibly constrain when they are applied to a simulation. Selectors can match well-known labels like `node.kubernetes.io/instance-type` or `karpenter.sh/node-pool` and custom labels defined in a NodePool's `.spec.labels` or `.spec.requirements`. When multiple NodeOverlays match an offering for a given simulation, they are merged together using `spec.weight` to resolve conflicts.

### Examples

Configure price to match a compute savings plan
```yaml
kind: NodeOverlay
metadata:
name: default-discount
spec:
pricePercent: 90
```
Support extended resource types (e.g. smarter-devices/fuse)
* https://github.com/kubernetes-sigs/karpenter/issues/751
* https://github.com/kubernetes-sigs/karpenter/issues/729
```yaml
kind: NodeOverlay
metadata:
name: extended-resource
spec:
selector:
matchLabels:
karpenter.sh/capacity-type: on-demand
matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values: ["m5.large", "m5.2xlarge", "m5.4xlarge", "m5.8xlarge", "m5.12xlarge"]
capacity:
smarter-devices/fuse: 1
```
Add memory overhead of of 10Mi and 50Mi for all instances with more than 2Gi memory
* https://github.com/aws/karpenter-provider-aws/issues/5161
```yaml
kind: NodeOverlay
metadata:
name: memory-default
spec:
overhead:
memory: 10Mi
---
kind: NodeOverlay
metadata:
name: memory
spec:
weight: 1
selector:
matchExpressions:
key: karpenter.k8s.aws/instance-memory
operator: Gt
values: [2048]
overhead:
memory: 50Mi
---
```

### Selectors

NodeOverlay's selector semantic allows for fine grained control over when values are injected into the scheduling simulation. An empty selector applies to all simulations, or 1:n with InstanceType. Selectors could be written that define a 1:1 relationship between NodeOverlay and InstanceType. Selectors can even be written to apply n:1 with InstanceType, through additional labels that constrain by NodePool, NodeClass, zone, or other scheduling dimensions.

A [previous proposal](https://github.com/aws/karpenter-provider-aws/pull/2390/files#diff-b7121b2d822e57b70203794f3b6367e7173b16d84070c3712cd2f15b8dbd11b2R133) explored the tradeoffs between a 1:1 semantic vs more flexible alternatives. It raised valid observability challenges around communicating to customers exactly how the CR was modifying the scheduling simulation. It recommended a 1:1 approach, with the assumption that some system would likely dynamically codegenerate the CRs for the relevant use case. We received [direct feedback](https://github.com/aws/karpenter-provider-aws/pull/2404#issuecomment-1250688882) from some customers that the number of variations would be combinatorial, and thus unmanagable for their use case. Ultimately, we paused this analysis in favor of other priorities.

After two years of additional customer feedback in this problem space, this proposal asserts that the flexibility of a selector-based approach is a signficant simplification for customers compared to a 1:1 approach and avoids the scalability challenges associated with persisting large numbers of CRs in the Kubernetes API. This decision will require us to work through the associated observability challenges.

### Merging

Our decision to pursue Selectors that allow for an n:m semantic presents a challenge where multiple NodeOverlays could match a simulation, but with different values. To address this, we will support an optional `.spec.weight` parameter (discussed in the [previous proposal](https://github.com/aws/karpenter-provider-aws/pull/2390/files#diff-b7121b2d822e57b70203794f3b6367e7173b16d84070c3712cd2f15b8dbd11b2R153)).

We have a few choices for dealing with conflicting NodeOverlays

1. Multiple NodeOverlays are merged in order of weight
2. Multiple NodeOverlays are applied sequentially in order of weight
3. One NodeOverlay is selected by weight

We recommend option 1, as it supports the common use case of default + override semantic. In the memory overhead example, both NodeOverlays attempt to define the same parameter, but one is more specific than the other. Option 1 and 3 allow us to support this use case, where option 2 would result in 50Mi+10Mi. We are not aware of use cases that would benefit from sequential application.

We choose option 1 over option 3 to support NodeOverlays that define multiple values. For example, we could collapse the example NodeOverlays using a merge semantic.

```yaml
kind: NodeOverlay
metadata:
name: default
spec:
pricePercent: 90
overhead:
memory: 10Mi
---
kind: NodeOverlay
metadata:
name: memory
spec:
weight: 1 # required to resolve conflicts with default
selector:
matchExpressions:
key: karpenter.k8s.aws/instance-memory
operator: Gt
values: [2048]
overhead:
memory: 50Mi
---
kind: NodeOverlay
metadata:
name: extended-resource
spec:
# weight is not required since default does not conflict with extended resources
selector:
matchLabels:
karpenter.sh/capacity-type: on-demand
matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values: ["m5.large", "m5.2xlarge", "m5.4xlarge", "m5.8xlarge", "m5.12xlarge"]
capacity:
smarter-devices/fuse: 1
```
Option 1 enables us to be strictly more flexible than option 3. However, it introduces ambiguity when multiple NodeOverlays have the same weight. Option 3 would only allow one to apply, so any conflict must be a misconfiguration. However, our example demonstrates that Option 1 has valid use cases where multiple NodeOverlays share the same weight, as long as they don't define the same value. For Option 1, NodeOverlays with the same weight that both attempt to configure the same field are considered to be a misconfiguration.
### Misconfiguration
Given our merge semantics, it's possible for customers to define multiple NodeOverlays with undefined behavior. In both cases, we will identify and communicate these misconfigurations, but should we fail open or fail closed? Failing closed means that a misconfigured would halt Karpenter from launching new Nodes until the misconfiguration is resolved. Failing open means that Karpenter might launch capacity that differs from what an operator expects, and result in thrashing scale in/outs.
We choose to fail closed to prevent undefined behavior that could incur costs. However, we have the option to fail closed for the entire provisioning loop, or for a single NodePool. For example, if multiple NodeOverlays match NodePoolA with the same weight, but only a single NodeOverlay matches NodePoolB, Karpenter could exclude NodePoolA as invalid, but proceed with NodePoolB for simulation. These semantics are very similar to how Karpenter deals with `.spec.limits` or `InsufficientCapacityErrors`. See: https://karpenter.sh/docs/concepts/scheduling/#fallback.

### Observability

How can users be aware of how their NodeOverlays have applied to their NodePools? Following from our decision above, we will model the application of NodeOverlays at the NodePool level. This leads us to observe the behavior of NodeOverlays in the status of the NodePool.

For example:
```
kind: NodePool
status:
conditions:
- type: InstanceTypesReady
status: True
nodeOverlays:
- default
- extended-resources
instanceTypes:
- m5.large
- ...
```

Exposing the full details of the mutations caused by NodeOverlays cannot be feasibly stored in ETCD. Further, we do not want to require that customers read the Karpenter logs to debug this feature. When GetInstanceTypes is called per NodePool, we will emit an event on the NodePool to surface deeper semantic information about the effects of the NodeOverlays. Our implementation will need to be conservative on the volume and frequency of information, but provide sufficient insight for customers to verify that their configurations are working as expected.

### Per-Field Semantics

The semantics of each field carry individual nuance. Price could be modeled directly via a `price` field, but we can significantly reduce the number of NodeOverlays that need to be defined if we instead model the semantic as a function of the currently defined price (e.g. `pricePercent` or `priceAdjustment`). Similarly, `capacity` and `overhead` make sense to be modeled directly, but `capacityPercent` makes much less sense than `overheadPercent`.

We expect these semantics to be specific to each field, and do not attempt to define or support a domain specific language (e.g. CEL) for arbitrary control over these fields. We expect this API to evolve as we learn about the specific use cases.

### Launch Scope

To accelerate feedback on this new mechanism, we will release this API as v1alpha1, with support for a single field: `pricePercent`. All additional scope described in this doc will be left to followon RFCs and implementation effort.
85 changes: 85 additions & 0 deletions kwok/charts/crds/karpenter.sh_nodeoverlays.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
---
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
annotations:
controller-gen.kubebuilder.io/version: v0.15.0
name: nodeoverlays.karpenter.sh
spec:
group: karpenter.sh
names:
categories:
- karpenter
kind: NodeOverlay
listKind: NodeOverlayList
plural: nodeoverlays
singular: nodeoverlay
scope: Cluster
versions:
- name: v1alpha1
schema:
openAPIV3Schema:
properties:
apiVersion:
description: |-
APIVersion defines the versioned schema of this representation of an object.
Servers should convert recognized schemas to the latest internal value, and
may reject unrecognized values.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources
type: string
kind:
description: |-
Kind is a string value representing the REST resource this object represents.
Servers may infer this from the endpoint the client submits requests to.
Cannot be updated.
In CamelCase.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
type: string
metadata:
type: object
spec:
properties:
pricePercent:
description: PricePercent modifies the price of the simulated node
(PriceAdjustment + (Price * PricePercent / 100)).
type: integer
selector:
description: Selector matches against simulated nodes and modifies
their scheduling properties. Matches all if empty.
items:
description: |-
A node selector requirement is a selector that contains values, a key, and an operator
that relates the key and values.
properties:
key:
description: The label key that the selector applies to.
type: string
operator:
description: |-
Represents a key's relationship to a set of values.
Valid operators are In, NotIn, Exists, DoesNotExist. Gt, and Lt.
type: string
values:
description: |-
An array of string values. If the operator is In or NotIn,
the values array must be non-empty. If the operator is Exists or DoesNotExist,
the values array must be empty. If the operator is Gt or Lt, the values
array must have a single element, which will be interpreted as an integer.
This array is replaced during a strategic merge patch.
items:
type: string
type: array
x-kubernetes-list-type: atomic
required:
- key
- operator
type: object
type: array
type: object
required:
- spec
type: object
served: true
storage: true
subresources:
status: {}
85 changes: 85 additions & 0 deletions pkg/apis/crds/karpenter.sh_nodeoverlays.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
---
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
annotations:
controller-gen.kubebuilder.io/version: v0.15.0
name: nodeoverlays.karpenter.sh
spec:
group: karpenter.sh
names:
categories:
- karpenter
kind: NodeOverlay
listKind: NodeOverlayList
plural: nodeoverlays
singular: nodeoverlay
scope: Cluster
versions:
- name: v1alpha1
schema:
openAPIV3Schema:
properties:
apiVersion:
description: |-
APIVersion defines the versioned schema of this representation of an object.
Servers should convert recognized schemas to the latest internal value, and
may reject unrecognized values.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources
type: string
kind:
description: |-
Kind is a string value representing the REST resource this object represents.
Servers may infer this from the endpoint the client submits requests to.
Cannot be updated.
In CamelCase.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
type: string
metadata:
type: object
spec:
properties:
pricePercent:
description: PricePercent modifies the price of the simulated node
(PriceAdjustment + (Price * PricePercent / 100)).
type: integer
selector:
description: Selector matches against simulated nodes and modifies
their scheduling properties. Matches all if empty.
items:
description: |-
A node selector requirement is a selector that contains values, a key, and an operator
that relates the key and values.
properties:
key:
description: The label key that the selector applies to.
type: string
operator:
description: |-
Represents a key's relationship to a set of values.
Valid operators are In, NotIn, Exists, DoesNotExist. Gt, and Lt.
type: string
values:
description: |-
An array of string values. If the operator is In or NotIn,
the values array must be non-empty. If the operator is Exists or DoesNotExist,
the values array must be empty. If the operator is Gt or Lt, the values
array must have a single element, which will be interpreted as an integer.
This array is replaced during a strategic merge patch.
items:
type: string
type: array
x-kubernetes-list-type: atomic
required:
- key
- operator
type: object
type: array
type: object
required:
- spec
type: object
served: true
storage: true
subresources:
status: {}
19 changes: 19 additions & 0 deletions pkg/apis/v1alpha1/doc.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
/*
Copyright The Kubernetes Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

// +k8s:deepcopy-gen=package
// +groupName=karpenter.sh
package v1alpha1 // doc.go is discovered by codegen
Loading

0 comments on commit 4573cb8

Please sign in to comment.