Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turn automatic-recovery off by default #94

Open
himanshu-kun opened this issue Aug 12, 2022 · 5 comments
Open

Turn automatic-recovery off by default #94

himanshu-kun opened this issue Aug 12, 2022 · 5 comments
Assignees
Labels
area/performance Performance (across all domains, such as control plane, networking, storage, etc.) related kind/discussion Discussion (enaging others in deciding about multiple options) kind/enhancement Enhancement, improvement, extension kind/test Test priority/5 Priority (lower number equals higher priority)

Comments

@himanshu-kun
Copy link
Contributor

himanshu-kun commented Aug 12, 2022

What would you like to be added:
MCM should by default turn the automatic recovery(https://aws.amazon.com/about-aws/whats-new/2022/03/amazon-ec2-default-automatic-recovery/) off for an instance. automatic-recovery is a feature offered by AWS which will recover the instance on a new host in case of host failure, with the same instance id , volume attached.

Why is this needed:
Currently MCM itself has a health check mechanism where it terminates a machine in case its unhealthy(kubelet not responding or some other conditions) for over healthTimeout(by default 10 min). This means we have two health check actions which could race against each other.
If AWS autorecovery , happens before health-check , then its fine
But if it takes longer time (means the instance is still in transfer mode from one instance to other, volumes are detaching) then MCM recovery would kick in and it'll delete the instance on new host to start a new instance all together , leading to detachment of volumes again and a longer recovery which is undesirable.

@himanshu-kun himanshu-kun added the kind/enhancement Enhancement, improvement, extension label Aug 12, 2022
@himanshu-kun himanshu-kun changed the title Turn automatic-recovery off Turn automatic-recovery off by default Aug 12, 2022
@himanshu-kun himanshu-kun added kind/discussion Discussion (enaging others in deciding about multiple options) kind/test Test area/performance Performance (across all domains, such as control plane, networking, storage, etc.) related priority/2 Priority (lower number equals higher priority) labels Aug 12, 2022
@himanshu-kun
Copy link
Contributor Author

Initial discussion could be found here https://sap-ti.slack.com/archives/CBVQLMS6N/p1659345327978849

It needs to be tested whether this kind of scenario could actually happen

@himanshu-kun
Copy link
Contributor Author

Another question is :
Whether to keep the parameter configurable by the user. MCM health recovery is currently enough though, but there could be scenarios where customer would want to depend on the AWS recovery method

@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Feb 8, 2023
@himanshu-kun
Copy link
Contributor Author

@D063648 do you still find this feature relevant ?

@himanshu-kun himanshu-kun added needs/planning Needs (more) planning with other MCM maintainers priority/4 Priority (lower number equals higher priority) priority/5 Priority (lower number equals higher priority) and removed priority/2 Priority (lower number equals higher priority) priority/4 Priority (lower number equals higher priority) labels Feb 27, 2023
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Nov 6, 2023
@rishabh-11 rishabh-11 removed lifecycle/rotten Nobody worked on this for 12 months (final aging stage) needs/planning Needs (more) planning with other MCM maintainers labels Jul 3, 2024
@rishabh-11
Copy link
Contributor

Grooming Decision:-

Check if auto-recovery is progressing (Research is needed to see if this is possible). If yes, relook at the health timeout to allow the instance to recover.

@ashwani2k
Copy link

This may not be required if we introduce gardener/machine-controller-manager#755.
@rishabh-11 to take a call if we need to do this here or we can close this in favor of React faster if VM instance is gone (i.e. don’t wait until full machineHealthTimeout/machineCreationTimeout lapses) #755

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Performance (across all domains, such as control plane, networking, storage, etc.) related kind/discussion Discussion (enaging others in deciding about multiple options) kind/enhancement Enhancement, improvement, extension kind/test Test priority/5 Priority (lower number equals higher priority)
Projects
None yet
Development

No branches or pull requests

5 participants