-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow registration TTL to be user configurable #357
Comments
Were you able to find a way to reduce the time for boostraping the cluster at your end itself , since if you try to keep a higher registeration TTL you might encounter additional latency in provision of other nodes here (You will notice this when there are many of them you require) , When we think in terms of our reconcillation time and performance trade offs handling. Generally there might be a good chance to improve the boostraping process over configuring the timeouts here since there are higher possibilities you might end up compounding your latencies |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
To some degree we were able to address our bootstrapping times however we continue to run into issues with GPU based nodes. This is because without significantly fragmenting our nodepools because essentially we would have to bake an AMI per driver variants and version, we can't actually drive down the bootstrap time. Instances like x1.32xlarge takes 20 minutes for us to start up due to downloading + installing drivers. I suspect this is common for those who use GPUs. Chatted with @Bryce-Soghigian who also have some data points |
/remove-lifecycle stale |
AKS has some failures where the E2E tests we have for GPU bootstrapping will fail due to taking to long to initialize some of the GPU instances. (Would have to go digging its not common since we use small GPU skus for the e2es) One could argue, use Prebaked GPU images. Bootstrapping taking longer than 15 minutes is ridiculous and we should insist on the highest standard. This is a pattern AKS is moving away from see dedicated GPU VHD and the vhds for Ubuntu2204 will not be supporting this pattern. Even if they did, it doesn't fully solve the problem. If I want to use GPU operator to leverage some drivers AKS doesn't have, or some other pattern to install the GPU drivers myself and manage that lifecycle, then it can't be baked in. (Note azure/karpenter-provider-azure, does not support GPU operator currently, but is an interesting direction) Cluster Autoscaler exposes |
Other alternatives to consider may be provider-configurable (a more conservative step forward, likely a low-hanging fruit), or internally configurable by instance type (assuming all known use cases are directly related to using particular instance types). |
Also, here is the updated code link: karpenter/pkg/controllers/nodeclaim/lifecycle/liveness.go Lines 38 to 40 in 0fea7ce
|
I'm open to make this a var and provider-configurable. I'm not 100% sold yet that this should be user-configurable since ideally we should just enforce best practices rather than surfacing a bunch of options that allow users to get away with the wrong thing. |
I guess rather than a raw number, providing users with 2 different TTLs. One for normal instances (15minutes) and second for slow start up (30 minutes?). That way we provide some flexibility without completely allowing users to make a really long duration? We can expose this via EC2NodeClass or the config map instance type override config map #751 (comment) |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Is ther any plan for this issue? |
any plans to implement this ? |
/triage accepted |
If any would like to contribute to this issue, please open an RFC |
/help |
@engedaam: GuidelinesPlease ensure that the issue body includes answers to the following questions:
For more details on the requirements of such an issue, please see here and ensure that they are met. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Tell us about your request
At this time registration TTL is hard code to 15 minutes. It would be greatly appreciated if we can make this configurable by the user.
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
At this time registration TTL is hard code, https://github.com/aws/karpenter-core/blob/main/pkg/controllers/machine/lifecycle/liveness.go#L38, however some of our nodes takes close to and some times over the 15 minute time window to run -> bootstrap -> join the cluster. This means that these nodes will be terminated after 15 minutes even though they would have been able to join the cluster.
Are you currently working around this issue?
Since this isn't user configurable, we have the fork the code and build our own image for Karpenter to bypass this limitation.
Additional Context
https://kubernetes.slack.com/archives/C02SFFZSA2K/p1685123613509889 Conversation from slack
Attachments
No response
Community Note
The text was updated successfully, but these errors were encountered: