CA - AWS CloudProvider - Fallback to Static EC2 list rather than fatal error #4480

gjtempleton · 2021-11-28T23:30:13Z

Related to #4464

Currently, if the CA is unable to dynamically load instances the CA immediately crashes fatally. Instead, we should gracefully fall back to degraded functionality, using the bundled static list of EC2 instance types, warning the user this is what we're doing.

I'm not 100% sure this is the right move to make, as it will potentially mask more errors in people's configs by running with degraded functionality, though #4468 should help alleviate this, by returning more meaningful AWS errors from GenerateEC2InstanceTypes.

In this case, if someone was running this way (e.g. due to insufficient IAM permissions) and tried to use a newer instance type not included in the fallback static list, they would then receive an error. It may not be clear from the distance between where this fallback message is, and the error being generated why this would be happening: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L328

…l error

k8s-ci-robot · 2021-11-28T23:30:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gjtempleton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/cloudprovider/aws/OWNERS~~ [gjtempleton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gjtempleton · 2021-11-28T23:36:24Z

@Jeffwan I would appreciate your input on this.

/assign @jaypipes

MyannaHarris

This change makes sense since random issues out of the user's control could cause the failure to dynamically load the information, but, correct me if I'm wrong, the instance type information only gets loaded when Cluster Autoscaler starts up. Since the instance type information never gets updated, defaulting to the static information could definitely mask issues on the user's side. And since it only gets called when Cluster Autoscaler starts-up, the warning could easily be missed by a user.

I think this change would work great if there was also a periodic regeneration of the instance types information. Then, if it's a user issue, this warning would be printed multiple times in the logs. Or, if it's an issue out of their control, the map will be fixed soon automatically.

If I read the code wrong and there is in fact already a periodic regeneration of the instance type information, then this change looks good.

jaypipes · 2021-12-02T14:35:36Z

@gjtempleton @MyannaHarris yeah, I think we should revisit the whole dynamic instance type fetching. It's cause OOM issues (#4220, #4036, #3044, #3506) and I believe a better, more stable, approach to solving this problem would be to add a periodic CI job, initially set to run, say, every night or something (but potentially being triggered off some AWS-sourced event) that calls the DescribeInstanceTypes API call and regenerates the static instance type structs, automatically creating a pull request that updates the master branch. Potentially we could write some automation that does the same for release branches as well... thoughts?

gjtempleton · 2021-12-02T16:54:37Z

Thanks for the feedback and thoughts.

No arguments here with revisiting the dynamic instance type fetching wholesale. There are already some incremental improvements in flight/implemented, though they don't materially change the model of a hardcoded fallback list or a default dynamically generated one on startup.

There has been an improvement in the memory use of the current approach since the merging of #4199 to move to stream processing of the API. We also now have a PR (which I haven't had time to have a look at yet) to move from the current JSON implementation to using DescribeInstanceTypes in #4468.

As much as I love the proposal to automate the regeneration of the static list, we'd need to move to an automated process of also cutting and promoting the image releases to make it usable enough for users to move away from the dynamic list generation on startup being the default behaviour, there's currently far too much manual action/friction involved in that pipeline to make the process sufficiently fast in my view.

gjtempleton · 2022-01-10T15:46:21Z

Closing as superseded by the larger improvements brought in by #4468

CA - AWS CloudProvider - Fallback to Static EC2 list rather than fata…

6c3c983

…l error

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 28, 2021

k8s-ci-robot requested review from aleksandra-malinowska and feiskyer November 28, 2021 23:30

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 28, 2021

k8s-ci-robot assigned jaypipes Nov 28, 2021

gjtempleton mentioned this pull request Nov 28, 2021

CA failed to load Instance Type list unless configured with hostNetworking #4464

Closed

MyannaHarris reviewed Dec 1, 2021

View reviewed changes

jbartosik added the area/cluster-autoscaler label Dec 2, 2021

gjtempleton closed this Jan 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CA - AWS CloudProvider - Fallback to Static EC2 list rather than fatal error #4480

CA - AWS CloudProvider - Fallback to Static EC2 list rather than fatal error #4480

gjtempleton commented Nov 28, 2021 •

edited

Loading

k8s-ci-robot commented Nov 28, 2021

gjtempleton commented Nov 28, 2021

MyannaHarris left a comment

jaypipes commented Dec 2, 2021

gjtempleton commented Dec 2, 2021

gjtempleton commented Jan 10, 2022

CA - AWS CloudProvider - Fallback to Static EC2 list rather than fatal error #4480

CA - AWS CloudProvider - Fallback to Static EC2 list rather than fatal error #4480

Conversation

gjtempleton commented Nov 28, 2021 • edited Loading

k8s-ci-robot commented Nov 28, 2021

gjtempleton commented Nov 28, 2021

MyannaHarris left a comment

Choose a reason for hiding this comment

jaypipes commented Dec 2, 2021

gjtempleton commented Dec 2, 2021

gjtempleton commented Jan 10, 2022

gjtempleton commented Nov 28, 2021 •

edited

Loading