Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Add instance type settings design proposal #2390

Closed

Conversation

jonathan-innis
Copy link
Contributor

Fixes #

Description

  • Adds Instance Type settings design doc and RFC.

How was this change tested?

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

Release Note

None

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@netlify
Copy link

netlify bot commented Aug 29, 2022

Deploy Preview for karpenter-docs-prod canceled.

Name Link
🔨 Latest commit 821aa2a
🔍 Latest deploy log https://app.netlify.com/sites/karpenter-docs-prod/deploys/632b54b56022dd000a1dbf1a

@jonathan-innis jonathan-innis force-pushed the instance-type-settings-design branch 3 times, most recently from a584057 to 8e07c9d Compare August 29, 2022 17:21
@jonathan-innis
Copy link
Contributor Author

#2129, #2180, #1490, #2129, #1803, #1995, and #2161 addressed by this design. RFC from issue owners and others that are interested in the design.

@jonathan-innis jonathan-innis force-pushed the instance-type-settings-design branch 2 times, most recently from e86a4ae to 6b59714 Compare August 30, 2022 00:49
njtran
njtran previously approved these changes Aug 30, 2022
Copy link
Contributor

@njtran njtran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just one nit

designs/instance-type-settings.md Outdated Show resolved Hide resolved
@jonathan-innis jonathan-innis force-pushed the instance-type-settings-design branch 2 times, most recently from 249bf8f to 738cea0 Compare August 30, 2022 21:41
@jonathan-innis jonathan-innis marked this pull request as ready for review August 31, 2022 17:53
@jonathan-innis jonathan-innis requested a review from a team as a code owner August 31, 2022 17:53
@jonathan-innis jonathan-innis requested a review from dewjam August 31, 2022 17:53
@jonathan-innis jonathan-innis force-pushed the instance-type-settings-design branch from 10dd135 to faf549e Compare August 31, 2022 17:54
3. In the meantime, while most users may not be using either mechanism to achieve system-wide pod overheads, we will still calculate the `kubeReserved` values based on pod density. There are still some open questions based on whether we should open up the ability to specify the strategy type (see below)
3. Max Pods
1. In general, it appears to be an anti-pattern in the Kubernetes community to strictly fix maxPods values (for kubeletConfiguration) on a per-instance type level. Users should be restricting the number of pods on their instances based on pod requests and size of instances as opposed to strict pod density sizing. If users are concerned with Provisioners creating instances that are too large, there is now the option to specify GT or LT operators for setting limits for how large an instance type that is provisioned can get.
2. 🔑 For users that continue to utilize and/or need `--max-pods`, we will surface a `pods` value in the `.spec.allocatable` of the `InstanceType`. This `pods` value will be used for pod density for bin-packing and scheduling. Since there would now be a `max-pods` value surfaced at both the `InstanceType` level and at the `Provisioner` level, we will take the `min(provisioner.maxPods, instanceType.maxPods)` to be the pod density for the instance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this is the only field of the instancetype CRD that is capable of influencing user data. I'm a bit wary of this coupling, how it might limit our design decisions in the future, as well as the security implications. Put differently, InstanceType is currently a readonly type and you're proposing to give it write privileges. It currently isn't capable of modifying the configuration of nodes, which is a special privilege of the Provisioner. I'm comfortable using it for binpacking, since this relationship exists today.

Here's a less controversial alternative that functions identically w/ how type InstanceType interface works today, and doesn't sacrifice any use cases.

  1. The InstanceType CRD can expose pods as a value in Allocatable
  2. The maxpods value can be hardcoded in the provisioner's kubelet configuration
  3. Our binpacker will default to what the InstanceType says (which may be overriden by the CRD), but if the provisioner sets a value, that's what we'll use.
  4. If the customer wants to compute a dynamic (w.r.t. instancetype) max pods value, they can do so in user data, and they just need to make sure that this function agrees with the values they've overridden in the instancetype CRD.
  5. If you need different dynamic max pods values for the same instance type but different provisioners, stop and rethink your architecture.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as well as the security implications
I'm not quite sure I get the security implications that you are trying to call out here. Maybe you could describe it further. I see that this changes some read/write assumptions about the InstanceType and how it interacts with the current code logic but I'm not quite sure how this has security implications for the project.

Here's a less controversial alternative that functions
I think the other option to call out is that we can surface podsPerCore in the kubeletConfiguration in the Provisioner which would give users another method to have this be a dynamic value without the overhead of having to write custom userData. If we are going to say that pods in the .sepc.capacity can't affect --max-pods, I would rather just punt having pods included entirely.


**🔑 Open Questions**

1. There is still [users asks](https://github.com/aws/karpenter/issues/1803) for calculating the kubeReserved resources value based on the GKE calculation of the kubeReserved resources value as opposed to the EKS default which is based on pod density. Should we allow some flag on the provisioner that can determine which way to base the calculation on? i.e. `kubeReservedMemoryStrategy`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love for us to do the work and figure out containerd overhead in 2022, and then provide customers with an accurate runtime class.

@ellistarn
Copy link
Contributor

🚀 🎸 🎸 🚀

@jonathan-innis jonathan-innis force-pushed the instance-type-settings-design branch 2 times, most recently from ff2d088 to 6659fe2 Compare September 1, 2022 06:40
Copy link
Member

@chrisnegus chrisnegus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads really well. I just have few small suggestions.

designs/instance-type-settings.md Outdated Show resolved Hide resolved
designs/instance-type-settings.md Outdated Show resolved Hide resolved
designs/instance-type-settings.md Outdated Show resolved Hide resolved
designs/instance-type-settings.md Outdated Show resolved Hide resolved
designs/instance-type-settings.md Outdated Show resolved Hide resolved
designs/instance-type-settings.md Outdated Show resolved Hide resolved
designs/instance-type-settings.md Outdated Show resolved Hide resolved
designs/instance-type-settings.md Outdated Show resolved Hide resolved
designs/instance-type-settings.md Outdated Show resolved Hide resolved
designs/instance-type-settings.md Outdated Show resolved Hide resolved
@jonathan-innis jonathan-innis changed the title docs: Add instance type settings design proposal docs: RFC: Add instance type settings design proposal Sep 15, 2022
3. If we were to deny overlapping `InstanceTypeSettings` , then the feature of specifying a group of instance types may not be used that often as configuration between instance types may rarely be shared
4. It’s unclear whether users would want have enough common configuration that they would assign across instance types. It makes more sense to iterate on a grouping mechanism later if there is a user ask for it

**Recommendation:** Use `InstanceType` CRD for initial iteration for 1-1 mappings before considering introducing more complex constructs to the user. While specifying rulesets might reduce the overhead burden for customers to keep track of separate `InstanceTypeSettings` , ultimately without an API surface to view exactly what settings that ruleset will produce, it will ultimately be confusing for users.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For our use case (similar to #1900 where nodes offer fuse devices via the smarter device manager) the recommended approach of having a 1-1 mapping would create quite a significant overhead and burden: For every instance type and size, we would need to create an InstanceType resource and configure it. Even with templating this is quite a bit of effort and seems prone to errors. Also, we would need to keep the list of available instance types up to date, increasing the maintenance work. So going with the second option sounds like a more sustainable approach to me personally.

Copy link
Contributor Author

@jonathan-innis jonathan-innis Sep 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recommended approach of having a 1-1 mapping would create quite a significant overhead and burden

How would you get around having this burden? My assumption is that each instance type wouldn't have a static number of resources that match the smarter-devices/fuse so you would need some dynamic way to configure the quantities for each of the different instance types, or at least a collection of instance types.

This becomes a problem when extending out to other resources. Even assuming that there is a consistent quantity of resources for the smarter-devices/fuse resource across all instance types, that certainly wouldn't be true for other extended resources.

Also, we would need to keep the list of available instance types up to date

This works as an override mechanism, so you would still have those instance types available if they didn't have a matching CR for them. The only difference there would be that the extended resource wouldn't be mapped onto it, which would still be something that would need to be addressed with the other "collection-based" approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the other thing that we were thinking with the approach proposed in this design, is we would offer some tooling/guidance around how to programmatically generate these values into CRs, so that the maintenance burden is reduced.

@jonathan-innis jonathan-innis force-pushed the instance-type-settings-design branch from 6659fe2 to 6dac178 Compare September 21, 2022 18:10
@jonathan-innis jonathan-innis changed the title docs: RFC: Add instance type settings design proposal RFC: Add instance type settings design proposal Oct 6, 2022
@jonathan-innis
Copy link
Contributor Author

Summarizing most feedback that we've heard. It seems like we might want to consider a mechanism for "selecting" sets of instance types so that we can ease the customer burden of having to maintain a 1:1 mapping.

An alternative solution here may be to keep the 1:1 mapping and provide some documentation around manifest generation for these InstanceType CRDs.

Though this might be more confusing to reason about; in general, it will cause less overhead for users.

@haarchri
Copy link

haarchri commented Dec 5, 2022

any chance to see this in one of the coming releases ?

@universam1
Copy link

universam1 commented Dec 26, 2022

@jonathan-innis @ellistarn We would really like to move out of the fork of #2161 for a significant number of clusters. We are forced to where otherwise it is be a blocking no-go for Karpenter.
In case the final solution is pending longer, would you please reconsider merging #2161 for the time being, as it does provide the MVP and is tested well by hundreds of nodes to be working as expected?
Thank you

@jonathan-innis
Copy link
Contributor Author

jonathan-innis commented Jan 18, 2023

@universam1 We've talked a bit about this internally. I think we're willing to take a short-term experimental path that would unblock you in the main project. Ideally, the hope is that we can get a more robust solution that should solve for all the problems in the RFC going forward, but right now we haven't put this at super high priority.

Here's the thoughts:

  1. The long-term solution is still under design and is deproiritized for the time-being but one idea that we have is to expose the InstanceTypes offered by a cloudprovider as CRs on the cluster. This would allow you to either SSA changes to the InstanceTypes or to create mutating webhooks that could receive the InstanceTypes and then edit them with anything that you need injected into them (like additional extended resources). Thoughts on this one?
  2. As a short-term solution, we’d like to add an entry to the current karpenter-global-settings configMap (aws.instanceTypeOverrides) that exposes any extended resources (on top of the known ones) through a mapping from instance types to a v1.ResourceList like the example below. This would be an alpha feature and would be highly likely to change at some point in the future in favor of something more akin to 1 that would be a more robust long-term solution. Would you be interested in contributing to this short-term solution?
aws.instanceTypeOverrides: "[{"name": "c5.large", "capacity":{"fuse: "1", "other-extended-resource": "2"}}]"

@ellistarn
Copy link
Contributor

Perhaps we can model it based off of the instance type struct? { "name": "c5.large", "capacity": {"fuse": "1", "other-extended-resource": "4"} }

@universam1
Copy link

Thank you @jonathan-innis for considering this request and taking into account several options to solve it. Very much appreciated!

We are certainly happy to contribute to a solution - if you can provide some more details where you would see it fit into and what you'd specifically like to see, that would help getting us started.
No issue with alpha grade solution that is likely going to change.

One requirement though from our side is to be agnostic to instance type and size. It is impossible for us to predefine all variants we like to support ( type x size ).
A simple globbing or regex support in aws.instanceTypeOverrides matcher doesn't sound hard for me to add and should meet our needs just fine. Thank you very much for consideration!

@jonathan-innis
Copy link
Contributor Author

It is impossible for us to predefine all variants we like to support

Can you explain why this is the case? If you want all types of the size .xlarge, you could just get the instance type names from an aws ec2 describe-instance-types call and then include these in the settings using some programatic means.

A simple globbing or regex support

I'm not really a fan of opening up this settings schema to regex-based pattern matching. It doesn't seem particularly explicit in its design and implementation and seems overly-complicated given what we are trying to implement.

@universam1
Copy link

universam1 commented Jan 23, 2023

include these in the settings using some programatic m

The instances we permit the devs. to request or allow Karpenter to choose, sum up at roughly 600 types and versions. Say anything >1 core, >3GB, incl. amd64 or arm64. Having to manage that, means updating a huge list and/or keeping hundreds of developers informed about what instances are whitelisted or why not.
Offloading this type decision is the selling point of Karpenter, and we love it for that. Feels like this undermines the whole concept.

seems overly-complicated given what we are trying to implement

Could you explain please why this is complicated to implement? I believe few lines of code would suffice. In contrast coding an API or generator for Karpenter definition is a huge effort to implement.
It would mean that Karpenter now submits to an external API that it would have to comply with, which could easily become an unpleasant dependency for the project!?

It doesn't seem particularly explicit in its design

Alternative, the type and size could be optional fields. Then the match would be explicitly undefined.

@jonathan-innis
Copy link
Contributor Author

jonathan-innis commented Jan 23, 2023

I'm less a fan of the regex solution and might be swayed toward doing a requirements-based solution that allows you to create disparate objects that contain requirements (of which instance-type name can be one) and capacity details with a merge semantic. Would this suite the case you are describing and be more maintainable for you?

[
   {
      "requirements":[
         {
            "key":"karpenter.k8s.aws/instance-size",
             "operator":"In",
             "values":["large"]
         }
      ],
      "capacity":{
         "fuse":"1",
         "other-extended-resource":"4"
      }
   }
]

In the ConfigMap, it would look like

"aws.instanceTypeOverrides": "[{"requirements":[{"key":"karpenter.k8s.aws/instance-size","operator":"In","values":["large"]}],"capacity":{"fuse":"1","other-extended-resource":"4"}}]"

@universam1
Copy link

might be swayed toward doing a requirements-based solution

That's a great one! I think the list of instance sizes is not that of an issue to manage. In fact I could imagine a hack to invert to a "NotIn: small" if we really want to permit all sizes but those irrelevant ones.

Thank you!
How do you like us to proceed?

@jonathan-innis
Copy link
Contributor Author

I think the list of instance sizes is not that of an issue to manage

On this same train of thought, what if there was a resources field in the Provisioner API that allowed you to specify the amount of extended resource that a Node that a Provisioner launched would have (we would assume this amount for inflight scheduling until the Node resources actually registered), do you feel like this type of solution would also work well for you? I'm assuming if you have a set of requirements that are something like "only use instance types that aren't of the small variety" that this would map naturally to Provisioner requirements, where you could add these extended resources.

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  resources:
    requests:
      fuse: "2" # We assume this much until the node registers
      some.example.com/extended-resource: "500Mi" # We assume this much until the node registers

@universam1
Copy link

what if there was a resources field

I like this semantic actually most, a single authoritative place for Karpenter config! 👏🏻

we would assume this amount for inflight scheduling until the Node resources actually registered

AFAIR Karpenter does not actively assign pods to nodes any more like it used to, leaving that to kubescheduler? Great, we'll only have to ensure that the smarter-device-manager daemonset is up in time and has patched the custom resources on the nodes, before Karpenter enters a new loop for unscheduled pods.... I guess we can tame such a flapping via startupTaints to have a defined cool down delay!? I think this should work fine!

I'm assuming if you have a set of requirements that are something like "only use instance types that aren't of the small variety" that this would map naturally to Provisioner requirements

Indeed, main part of our specs contain

spec:
  requirements:
    - key: karpenter.k8s.aws/instance-size
      operator: NotIn
      values:
        - nano
        - micro
        - small

Great idea!

@jonathan-innis
Copy link
Contributor Author

jonathan-innis commented Jan 24, 2023

before Karpenter enters a new loop for unscheduled pods

We handle this with the concept of "node initialization". We wait for the node to go ready and have all requested resources registered before we consider pending pods for scheduling to new nodes; so, you shouldn't have to handle this cool down period yourself. Once we are aware that a resource should appear on the node, we wait until we see it before we consider it empty or not eligible to schedule pods.

@jonathan-innis jonathan-innis added the needs-design Design required label Feb 27, 2023
@haarchri
Copy link

haarchri commented Mar 6, 2023

thanks for this RFC - i wonder what is the current status ?

@jonathan-innis
Copy link
Contributor Author

@haarchri We're keeping this open to keep the conversation going, since this is the largest item that is currently tracking and creating pointers to all the open issues. There's been a lot of discussion among the maintainers and contributors around how we should attack a lot of these disparate problems, so the RFC is still unsettled and probably won't be until the project settles and merges a lot of the other current feature work:

  1. feat: Machine Migration kubernetes-sigs/karpenter#176
  2. feat: added machine disruption gates kubernetes-sigs/karpenter#221

@jonathan-innis
Copy link
Contributor Author

As mentioned here, there is still a lot of uncertainty on the plan of attack with how to handle custom instance type settings. The issues mentioned in this initial RFC will be addressed at a later time, with a rewritten RFC. Closing this for now.

@universam1
Copy link

universam1 commented Apr 19, 2023

@ellistarn @jonathan-innis Trying to get a feeling for the time to expect for your long-term solution. We are wondering if we should maintain a fork with an intermediate solution for the time being or if there are other options possible?
As an idea, could we guard this #2161 implementation behind an alpha flag to unblock those users willing to take the risk for the time being? Would that be a feasable approach?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-design Design required
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants