-
Notifications
You must be signed in to change notification settings - Fork 989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Add instance type settings design proposal #2390
RFC: Add instance type settings design proposal #2390
Conversation
✅ Deploy Preview for karpenter-docs-prod canceled.
|
a584057
to
8e07c9d
Compare
e86a4ae
to
6b59714
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just one nit
249bf8f
to
738cea0
Compare
10dd135
to
faf549e
Compare
designs/instance-type-settings.md
Outdated
3. In the meantime, while most users may not be using either mechanism to achieve system-wide pod overheads, we will still calculate the `kubeReserved` values based on pod density. There are still some open questions based on whether we should open up the ability to specify the strategy type (see below) | ||
3. Max Pods | ||
1. In general, it appears to be an anti-pattern in the Kubernetes community to strictly fix maxPods values (for kubeletConfiguration) on a per-instance type level. Users should be restricting the number of pods on their instances based on pod requests and size of instances as opposed to strict pod density sizing. If users are concerned with Provisioners creating instances that are too large, there is now the option to specify GT or LT operators for setting limits for how large an instance type that is provisioned can get. | ||
2. 🔑 For users that continue to utilize and/or need `--max-pods`, we will surface a `pods` value in the `.spec.allocatable` of the `InstanceType`. This `pods` value will be used for pod density for bin-packing and scheduling. Since there would now be a `max-pods` value surfaced at both the `InstanceType` level and at the `Provisioner` level, we will take the `min(provisioner.maxPods, instanceType.maxPods)` to be the pod density for the instance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, this is the only field of the instancetype CRD that is capable of influencing user data. I'm a bit wary of this coupling, how it might limit our design decisions in the future, as well as the security implications. Put differently, InstanceType is currently a readonly type and you're proposing to give it write privileges. It currently isn't capable of modifying the configuration of nodes, which is a special privilege of the Provisioner. I'm comfortable using it for binpacking, since this relationship exists today.
Here's a less controversial alternative that functions identically w/ how type InstanceType interface
works today, and doesn't sacrifice any use cases.
- The InstanceType CRD can expose
pods
as a value inAllocatable
- The
maxpods
value can be hardcoded in the provisioner's kubelet configuration - Our binpacker will default to what the
InstanceType
says (which may be overriden by the CRD), but if the provisioner sets a value, that's what we'll use. - If the customer wants to compute a dynamic (w.r.t. instancetype) max pods value, they can do so in user data, and they just need to make sure that this function agrees with the values they've overridden in the instancetype CRD.
- If you need different dynamic max pods values for the same instance type but different provisioners, stop and rethink your architecture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as well as the security implications
I'm not quite sure I get the security implications that you are trying to call out here. Maybe you could describe it further. I see that this changes some read/write assumptions about theInstanceType
and how it interacts with the current code logic but I'm not quite sure how this has security implications for the project.
Here's a less controversial alternative that functions
I think the other option to call out is that we can surfacepodsPerCore
in thekubeletConfiguration
in theProvisioner
which would give users another method to have this be a dynamic value without the overhead of having to write customuserData
. If we are going to say thatpods
in the.sepc.capacity
can't affect--max-pods
, I would rather just punt havingpods
included entirely.
designs/instance-type-settings.md
Outdated
|
||
**🔑 Open Questions** | ||
|
||
1. There is still [users asks](https://github.com/aws/karpenter/issues/1803) for calculating the kubeReserved resources value based on the GKE calculation of the kubeReserved resources value as opposed to the EKS default which is based on pod density. Should we allow some flag on the provisioner that can determine which way to base the calculation on? i.e. `kubeReservedMemoryStrategy` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love for us to do the work and figure out containerd overhead in 2022, and then provide customers with an accurate runtime class.
🚀 🎸 🎸 🚀 |
ff2d088
to
6659fe2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reads really well. I just have few small suggestions.
designs/instance-type-settings.md
Outdated
3. If we were to deny overlapping `InstanceTypeSettings` , then the feature of specifying a group of instance types may not be used that often as configuration between instance types may rarely be shared | ||
4. It’s unclear whether users would want have enough common configuration that they would assign across instance types. It makes more sense to iterate on a grouping mechanism later if there is a user ask for it | ||
|
||
**Recommendation:** Use `InstanceType` CRD for initial iteration for 1-1 mappings before considering introducing more complex constructs to the user. While specifying rulesets might reduce the overhead burden for customers to keep track of separate `InstanceTypeSettings` , ultimately without an API surface to view exactly what settings that ruleset will produce, it will ultimately be confusing for users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For our use case (similar to #1900 where nodes offer fuse devices via the smarter device manager) the recommended approach of having a 1-1 mapping would create quite a significant overhead and burden: For every instance type and size, we would need to create an InstanceType
resource and configure it. Even with templating this is quite a bit of effort and seems prone to errors. Also, we would need to keep the list of available instance types up to date, increasing the maintenance work. So going with the second option sounds like a more sustainable approach to me personally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recommended approach of having a 1-1 mapping would create quite a significant overhead and burden
How would you get around having this burden? My assumption is that each instance type wouldn't have a static number of resources that match the smarter-devices/fuse
so you would need some dynamic way to configure the quantities for each of the different instance types, or at least a collection of instance types.
This becomes a problem when extending out to other resources. Even assuming that there is a consistent quantity of resources for the smarter-devices/fuse
resource across all instance types, that certainly wouldn't be true for other extended resources.
Also, we would need to keep the list of available instance types up to date
This works as an override mechanism, so you would still have those instance types available if they didn't have a matching CR for them. The only difference there would be that the extended resource wouldn't be mapped onto it, which would still be something that would need to be addressed with the other "collection-based" approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the other thing that we were thinking with the approach proposed in this design, is we would offer some tooling/guidance around how to programmatically generate these values into CRs, so that the maintenance burden is reduced.
6659fe2
to
6dac178
Compare
Summarizing most feedback that we've heard. It seems like we might want to consider a mechanism for "selecting" sets of instance types so that we can ease the customer burden of having to maintain a 1:1 mapping. An alternative solution here may be to keep the 1:1 mapping and provide some documentation around manifest generation for these Though this might be more confusing to reason about; in general, it will cause less overhead for users. |
any chance to see this in one of the coming releases ? |
@jonathan-innis @ellistarn We would really like to move out of the fork of #2161 for a significant number of clusters. We are forced to where otherwise it is be a blocking no-go for Karpenter. |
@universam1 We've talked a bit about this internally. I think we're willing to take a short-term experimental path that would unblock you in the main project. Ideally, the hope is that we can get a more robust solution that should solve for all the problems in the RFC going forward, but right now we haven't put this at super high priority. Here's the thoughts:
aws.instanceTypeOverrides: "[{"name": "c5.large", "capacity":{"fuse: "1", "other-extended-resource": "2"}}]" |
Perhaps we can model it based off of the instance type struct? |
Thank you @jonathan-innis for considering this request and taking into account several options to solve it. Very much appreciated! We are certainly happy to contribute to a solution - if you can provide some more details where you would see it fit into and what you'd specifically like to see, that would help getting us started. One requirement though from our side is to be agnostic to instance type and size. It is impossible for us to predefine all variants we like to support ( type x size ). |
Can you explain why this is the case? If you want all types of the size
I'm not really a fan of opening up this settings schema to regex-based pattern matching. It doesn't seem particularly explicit in its design and implementation and seems overly-complicated given what we are trying to implement. |
The instances we permit the devs. to request or allow Karpenter to choose, sum up at roughly 600 types and versions. Say anything >1 core, >3GB, incl. amd64 or arm64. Having to manage that, means updating a huge list and/or keeping hundreds of developers informed about what instances are whitelisted or why not.
Could you explain please why this is complicated to implement? I believe few lines of code would suffice. In contrast coding an API or generator for Karpenter definition is a huge effort to implement.
Alternative, the type and size could be optional fields. Then the match would be explicitly undefined. |
I'm less a fan of the regex solution and might be swayed toward doing a requirements-based solution that allows you to create disparate objects that contain requirements (of which instance-type name can be one) and capacity details with a merge semantic. Would this suite the case you are describing and be more maintainable for you? [
{
"requirements":[
{
"key":"karpenter.k8s.aws/instance-size",
"operator":"In",
"values":["large"]
}
],
"capacity":{
"fuse":"1",
"other-extended-resource":"4"
}
}
] In the ConfigMap, it would look like "aws.instanceTypeOverrides": "[{"requirements":[{"key":"karpenter.k8s.aws/instance-size","operator":"In","values":["large"]}],"capacity":{"fuse":"1","other-extended-resource":"4"}}]" |
That's a great one! I think the list of instance sizes is not that of an issue to manage. In fact I could imagine a hack to invert to a "NotIn: small" if we really want to permit all sizes but those irrelevant ones. Thank you! |
On this same train of thought, what if there was a apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: default
spec:
resources:
requests:
fuse: "2" # We assume this much until the node registers
some.example.com/extended-resource: "500Mi" # We assume this much until the node registers |
I like this semantic actually most, a single authoritative place for Karpenter config! 👏🏻
AFAIR Karpenter does not actively assign pods to nodes any more like it used to, leaving that to kubescheduler? Great, we'll only have to ensure that the
Indeed, main part of our specs contain spec:
requirements:
- key: karpenter.k8s.aws/instance-size
operator: NotIn
values:
- nano
- micro
- small Great idea! |
We handle this with the concept of "node initialization". We wait for the node to go ready and have all requested resources registered before we consider pending pods for scheduling to new nodes; so, you shouldn't have to handle this cool down period yourself. Once we are aware that a resource should appear on the node, we wait until we see it before we consider it empty or not eligible to schedule pods. |
thanks for this RFC - i wonder what is the current status ? |
@haarchri We're keeping this open to keep the conversation going, since this is the largest item that is currently tracking and creating pointers to all the open issues. There's been a lot of discussion among the maintainers and contributors around how we should attack a lot of these disparate problems, so the RFC is still unsettled and probably won't be until the project settles and merges a lot of the other current feature work: |
As mentioned here, there is still a lot of uncertainty on the plan of attack with how to handle custom instance type settings. The issues mentioned in this initial RFC will be addressed at a later time, with a rewritten RFC. Closing this for now. |
@ellistarn @jonathan-innis Trying to get a feeling for the time to expect for your long-term solution. We are wondering if we should maintain a fork with an intermediate solution for the time being or if there are other options possible? |
Fixes #
Description
How was this change tested?
Does this change impact docs?
Release Note
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.