-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CA - AWS CloudProvider - Fallback to Static EC2 list rather than fatal error #4480
CA - AWS CloudProvider - Fallback to Static EC2 list rather than fatal error #4480
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gjtempleton The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change makes sense since random issues out of the user's control could cause the failure to dynamically load the information, but, correct me if I'm wrong, the instance type information only gets loaded when Cluster Autoscaler starts up. Since the instance type information never gets updated, defaulting to the static information could definitely mask issues on the user's side. And since it only gets called when Cluster Autoscaler starts-up, the warning could easily be missed by a user.
I think this change would work great if there was also a periodic regeneration of the instance types information. Then, if it's a user issue, this warning would be printed multiple times in the logs. Or, if it's an issue out of their control, the map will be fixed soon automatically.
If I read the code wrong and there is in fact already a periodic regeneration of the instance type information, then this change looks good.
@gjtempleton @MyannaHarris yeah, I think we should revisit the whole dynamic instance type fetching. It's cause OOM issues (#4220, #4036, #3044, #3506) and I believe a better, more stable, approach to solving this problem would be to add a periodic CI job, initially set to run, say, every night or something (but potentially being triggered off some AWS-sourced event) that calls the |
Thanks for the feedback and thoughts. No arguments here with revisiting the dynamic instance type fetching wholesale. There are already some incremental improvements in flight/implemented, though they don't materially change the model of a hardcoded fallback list or a default dynamically generated one on startup. There has been an improvement in the memory use of the current approach since the merging of #4199 to move to stream processing of the API. We also now have a PR (which I haven't had time to have a look at yet) to move from the current JSON implementation to using As much as I love the proposal to automate the regeneration of the static list, we'd need to move to an automated process of also cutting and promoting the image releases to make it usable enough for users to move away from the dynamic list generation on startup being the default behaviour, there's currently far too much manual action/friction involved in that pipeline to make the process sufficiently fast in my view. |
Closing as superseded by the larger improvements brought in by #4468 |
Related to #4464
Currently, if the CA is unable to dynamically load instances the CA immediately crashes fatally. Instead, we should gracefully fall back to degraded functionality, using the bundled static list of EC2 instance types, warning the user this is what we're doing.
I'm not 100% sure this is the right move to make, as it will potentially mask more errors in people's configs by running with degraded functionality, though #4468 should help alleviate this, by returning more meaningful AWS errors from
GenerateEC2InstanceTypes
.In this case, if someone was running this way (e.g. due to insufficient IAM permissions) and tried to use a newer instance type not included in the fallback static list, they would then receive an error. It may not be clear from the distance between where this fallback message is, and the error being generated why this would be happening: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L328