Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA - AWS CloudProvider - Fallback to Static EC2 list rather than fatal error #4480

Closed

Conversation

gjtempleton
Copy link
Member

@gjtempleton gjtempleton commented Nov 28, 2021

Related to #4464

Currently, if the CA is unable to dynamically load instances the CA immediately crashes fatally. Instead, we should gracefully fall back to degraded functionality, using the bundled static list of EC2 instance types, warning the user this is what we're doing.

I'm not 100% sure this is the right move to make, as it will potentially mask more errors in people's configs by running with degraded functionality, though #4468 should help alleviate this, by returning more meaningful AWS errors from GenerateEC2InstanceTypes.

In this case, if someone was running this way (e.g. due to insufficient IAM permissions) and tried to use a newer instance type not included in the fallback static list, they would then receive an error. It may not be clear from the distance between where this fallback message is, and the error being generated why this would be happening: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L328

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 28, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gjtempleton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 28, 2021
@gjtempleton
Copy link
Member Author

@Jeffwan I would appreciate your input on this.

/assign @jaypipes

Copy link
Contributor

@MyannaHarris MyannaHarris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes sense since random issues out of the user's control could cause the failure to dynamically load the information, but, correct me if I'm wrong, the instance type information only gets loaded when Cluster Autoscaler starts up. Since the instance type information never gets updated, defaulting to the static information could definitely mask issues on the user's side. And since it only gets called when Cluster Autoscaler starts-up, the warning could easily be missed by a user.

I think this change would work great if there was also a periodic regeneration of the instance types information. Then, if it's a user issue, this warning would be printed multiple times in the logs. Or, if it's an issue out of their control, the map will be fixed soon automatically.

If I read the code wrong and there is in fact already a periodic regeneration of the instance type information, then this change looks good.

@jaypipes
Copy link
Contributor

jaypipes commented Dec 2, 2021

@gjtempleton @MyannaHarris yeah, I think we should revisit the whole dynamic instance type fetching. It's cause OOM issues (#4220, #4036, #3044, #3506) and I believe a better, more stable, approach to solving this problem would be to add a periodic CI job, initially set to run, say, every night or something (but potentially being triggered off some AWS-sourced event) that calls the DescribeInstanceTypes API call and regenerates the static instance type structs, automatically creating a pull request that updates the master branch. Potentially we could write some automation that does the same for release branches as well... thoughts?

@gjtempleton
Copy link
Member Author

Thanks for the feedback and thoughts.

No arguments here with revisiting the dynamic instance type fetching wholesale. There are already some incremental improvements in flight/implemented, though they don't materially change the model of a hardcoded fallback list or a default dynamically generated one on startup.

There has been an improvement in the memory use of the current approach since the merging of #4199 to move to stream processing of the API. We also now have a PR (which I haven't had time to have a look at yet) to move from the current JSON implementation to using DescribeInstanceTypes in #4468.

As much as I love the proposal to automate the regeneration of the static list, we'd need to move to an automated process of also cutting and promoting the image releases to make it usable enough for users to move away from the dynamic list generation on startup being the default behaviour, there's currently far too much manual action/friction involved in that pipeline to make the process sufficiently fast in my view.

@gjtempleton
Copy link
Member Author

Closing as superseded by the larger improvements brought in by #4468

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants