Allow registration TTL to be user configurable #357

garvinp-stripe · 2023-05-31T17:57:59Z

Tell us about your request

At this time registration TTL is hard code to 15 minutes. It would be greatly appreciated if we can make this configurable by the user.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

At this time registration TTL is hard code, https://github.com/aws/karpenter-core/blob/main/pkg/controllers/machine/lifecycle/liveness.go#L38, however some of our nodes takes close to and some times over the 15 minute time window to run -> bootstrap -> join the cluster. This means that these nodes will be terminated after 15 minutes even though they would have been able to join the cluster.

Are you currently working around this issue?

Since this isn't user configurable, we have the fork the code and build our own image for Karpenter to bypass this limitation.

Additional Context

https://kubernetes.slack.com/archives/C02SFFZSA2K/p1685123613509889 Conversation from slack

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

sadath-12 · 2023-10-14T06:55:00Z

Were you able to find a way to reduce the time for boostraping the cluster at your end itself , since if you try to keep a higher registeration TTL you might encounter additional latency in provision of other nodes here (You will notice this when there are many of them you require) , When we think in terms of our reconcillation time and performance trade offs handling. Generally there might be a good chance to improve the boostraping process over configuring the timeouts here since there are higher possibilities you might end up compounding your latencies

k8s-triage-robot · 2024-01-30T08:19:52Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

garvinp-stripe · 2024-02-21T19:07:04Z

To some degree we were able to address our bootstrapping times however we continue to run into issues with GPU based nodes. This is because without significantly fragmenting our nodepools because essentially we would have to bake an AMI per driver variants and version, we can't actually drive down the bootstrap time.

Instances like x1.32xlarge takes 20 minutes for us to start up due to downloading + installing drivers. I suspect this is common for those who use GPUs. Chatted with @Bryce-Soghigian who also have some data points

Bryce-Soghigian · 2024-02-21T19:08:45Z

/remove-lifecycle stale

Bryce-Soghigian · 2024-02-21T19:24:35Z

AKS has some failures where the E2E tests we have for GPU bootstrapping will fail due to taking to long to initialize some of the GPU instances. (Would have to go digging its not common since we use small GPU skus for the e2es)

One could argue, use Prebaked GPU images. Bootstrapping taking longer than 15 minutes is ridiculous and we should insist on the highest standard. This is a pattern AKS is moving away from see dedicated GPU VHD and the vhds for Ubuntu2204 will not be supporting this pattern. Even if they did, it doesn't fully solve the problem. If I want to use GPU operator to leverage some drivers AKS doesn't have, or some other pattern to install the GPU drivers myself and manage that lifecycle, then it can't be baked in. (Note azure/karpenter-provider-azure, does not support GPU operator currently, but is an interesting direction)

Cluster Autoscaler exposes MaxProvisioningTime which serves a similar purpose(CAS GCs nodes/vms that don't register with the api by MaxProvisioningTime)

tallaxes · 2024-02-21T23:15:14Z

Allow registration TTL to be user configurable

Other alternatives to consider may be provider-configurable (a more conservative step forward, likely a low-hanging fruit), or internally configurable by instance type (assuming all known use cases are directly related to using particular instance types).

tallaxes · 2024-02-21T23:15:34Z

Also, here is the updated code link:

karpenter/pkg/controllers/nodeclaim/lifecycle/liveness.go

Lines 38 to 40 in 0fea7ce

    
           // registrationTTL is a heuristic time that we expect the node to register within 
        
           // If we don't see the node within this time, then we should delete the NodeClaim and try again 
        
           const registrationTTL = time.Minute * 15

jonathan-innis · 2024-02-22T07:13:29Z

I'm open to make this a var and provider-configurable. I'm not 100% sold yet that this should be user-configurable since ideally we should just enforce best practices rather than surfacing a bunch of options that allow users to get away with the wrong thing.

garvinp-stripe · 2024-03-14T21:43:27Z

I guess rather than a raw number, providing users with 2 different TTLs. One for normal instances (15minutes) and second for slow start up (30 minutes?). That way we provide some flexibility without completely allowing users to make a really long duration? We can expose this via EC2NodeClass or the config map instance type override config map #751 (comment)

k8s-triage-robot · 2024-06-12T22:12:52Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-07-12T22:23:26Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-08-11T22:49:50Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-08-11T22:49:54Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

leoryu · 2024-09-20T07:43:43Z

Is ther any plan for this issue?

dmity-st · 2024-10-17T16:10:51Z

any plans to implement this ?

engedaam · 2024-12-12T17:44:51Z

/triage accepted
/remove-lifecycle rotten
/help-wanted

engedaam · 2024-12-12T17:45:28Z

If any would like to contribute to this issue, please open an RFC

engedaam · 2024-12-12T17:46:16Z

/help

k8s-ci-robot · 2024-12-12T17:46:19Z

@engedaam:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

garvinp-stripe added the kind/feature Categorizes issue or PR as related to a new feature. label May 31, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 30, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 21, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 12, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 12, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 11, 2024

njtran reopened this Sep 30, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Sep 30, 2024

njtran added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Sep 30, 2024

engedaam mentioned this issue Oct 16, 2024

Node claim lifetime configuration aws/karpenter-provider-aws#7212

Closed

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 12, 2024

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow registration TTL to be user configurable #357

Allow registration TTL to be user configurable #357

garvinp-stripe commented May 31, 2023

sadath-12 commented Oct 14, 2023

k8s-triage-robot commented Jan 30, 2024

garvinp-stripe commented Feb 21, 2024

Bryce-Soghigian commented Feb 21, 2024

Bryce-Soghigian commented Feb 21, 2024

tallaxes commented Feb 21, 2024

tallaxes commented Feb 21, 2024

jonathan-innis commented Feb 22, 2024

garvinp-stripe commented Mar 14, 2024

k8s-triage-robot commented Jun 12, 2024

k8s-triage-robot commented Jul 12, 2024

k8s-triage-robot commented Aug 11, 2024

k8s-ci-robot commented Aug 11, 2024

leoryu commented Sep 20, 2024

dmity-st commented Oct 17, 2024

engedaam commented Dec 12, 2024

engedaam commented Dec 12, 2024

engedaam commented Dec 12, 2024

k8s-ci-robot commented Dec 12, 2024

Allow registration TTL to be user configurable #357

Allow registration TTL to be user configurable #357

Comments

garvinp-stripe commented May 31, 2023

Tell us about your request

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Are you currently working around this issue?

Additional Context

Attachments

Community Note

sadath-12 commented Oct 14, 2023

k8s-triage-robot commented Jan 30, 2024

garvinp-stripe commented Feb 21, 2024

Bryce-Soghigian commented Feb 21, 2024

Bryce-Soghigian commented Feb 21, 2024

tallaxes commented Feb 21, 2024

tallaxes commented Feb 21, 2024

jonathan-innis commented Feb 22, 2024

garvinp-stripe commented Mar 14, 2024

k8s-triage-robot commented Jun 12, 2024

k8s-triage-robot commented Jul 12, 2024

k8s-triage-robot commented Aug 11, 2024

k8s-ci-robot commented Aug 11, 2024

leoryu commented Sep 20, 2024

dmity-st commented Oct 17, 2024

engedaam commented Dec 12, 2024

engedaam commented Dec 12, 2024

engedaam commented Dec 12, 2024

k8s-ci-robot commented Dec 12, 2024

Guidelines