Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

React faster if VM instance is gone (i.e. don’t wait until full machineHealthTimeout/machineCreationTimeout lapses) #755

Open
vlerenc opened this issue Oct 24, 2022 · 6 comments
Labels
area/robustness Robustness, reliability, resilience related kind/enhancement Enhancement, improvement, extension lifecycle/rotten Nobody worked on this for 12 months (final aging stage) needs/planning Needs (more) planning with other MCM maintainers priority/backlog Issues which needs implementation priority/2 Priority (lower number equals higher priority)

Comments

@vlerenc
Copy link
Member

vlerenc commented Oct 24, 2022

What happened:

While investigating issues with pending workload during a rolling update that looked as if drain started on nodes without sufficient new capacity (see also issue #754), we found out that VM instances that will never make it are not considered as such and the full machineHealthTimeout/machineCreationTimeout needs to lapse, which slows down auto-scaling/self-healing/rolling-update and may lead to prolonged pending pod times.

What was expected:

When a VM instance backing a machine is in fatal error or even terminated, we should probably act faster. While it was observed that hyperscaler API endpoints may temporarily serve stale or broken or no data at all (these systems are highly distributed and only eventually consistent) we should still act, if e.g.:

  • the API returns the VM instance with a status of in fatal error or terminated AND...
  • the kubelet stopped reporting at around or before the VM instance last timestamp then...
  • a much smaller timeout of e.g. 1m (instead of 10m/20m default for machineHealthTimeout/machineCreationTimeout) lapsed.

This should help to recover much faster in the future. WDYT?

How to reproduce it:

Terminate a running VM instance backing a machine resource or create a new machine resource (e.g. scale-out), but block (e.g. immediately terminate) the new VM instance from coming up. You will see that the machineHealthTimeout/machineCreationTimeout have to lapse before something happens even though the machines are in terminal state from which they will never recover and the kubelets do no longer report in.

while true; do sleep 0.1; aws ec2 describe-instances --filters Name=tag-key,Values=kubernetes.io/cluster/shoot--project--cluster Name=instance-state-code,Values=0 Name=availability-zone,Values=eu-central-1c --no-paginate --output json | jq -r .Reservations[].Instances[].InstanceId | tr '\n' ' ' | xargs -I {} sh -c 'echo Terminating Instances:  {}; aws ec2 terminate-instances --instance-ids {} | cat'; done
@vlerenc vlerenc added kind/enhancement Enhancement, improvement, extension priority/backlog Issues which needs implementation priority/3 Priority (lower number equals higher priority) labels Oct 24, 2022
@sskrustev
Copy link

Hello Vedran Lerenc,

This is kind reminder . Are there any updates here ?

Kind regards,
Stanimir

@elankath
Copy link
Contributor

elankath commented Feb 24, 2023

Grooming Results

  • Once the machine phase has moved to Unknown, we should check the the error code of Driver.GetMachineStatus call to see if the VM instance has gone.
  • If so, we should not wait for the health-check-timeout before moving the machine phase to Failed.
  • Instead of performing the GetMachineStatus call only for an Unknown machine, also consider regularly (say every few iterations of the reconcileHealthCheck) to check the VM instance status and moving the machine phase to Failed if VM is not there. This will obviate the need for the long Running -> Unknown -> Failed movement across consecutive invocations of health check reconciliation.

@himanshu-kun
Copy link
Contributor

himanshu-kun commented Feb 24, 2023

One possible solution , is to rely on CCM, as it deletes node obj if VM is gone. So we won't have to check the VM on infra atleast in machineHealthTimeout case. But the node can also be explicitly deleted by operator/user, then we need to test it out that the node obj doesn't come up again, because if it does then checking VM status would be the safest thing to do.

@himanshu-kun himanshu-kun added priority/2 Priority (lower number equals higher priority) needs/planning Needs (more) planning with other MCM maintainers area/robustness Robustness, reliability, resilience related and removed priority/3 Priority (lower number equals higher priority) labels Feb 24, 2023
@vlerenc
Copy link
Member Author

vlerenc commented Apr 5, 2023

we should check the the error code of Driver.GetMachineStatus call to see if the VM instance has gone.

The problem with that is that infrastructures such as Azure are notoriously inconsistent for a very long time window (much more than the 10 minutes we try to shave off here). They show the machine, don't show it, show it - it takes a while until all API endpoints return the same result. Then again, if the machine is older than 2h (time window we see this issue, e.g. with Azure) and one API endpoint returns that it's gone, it's probably gone (unless that API endpoint is broken in some other way). We could safe-guard that somewhat more and wait to see this result 3x (ignoring other results were the machine is reported present). Still, risky. We simply cannot trust the cloud provider API to return the correct result due to eventual consistency.

@elankath
Copy link
Contributor

elankath commented Apr 5, 2023

The problem with that is that infrastructures such as Azure are notoriously inconsistent for a very long time window (much more than the 10 minutes we try to shave off here). They show the machine, don't show it, show it - it takes a while until all API endpoints return the same result.

Agreed. We don't need to immediately take a decision to move the Machine to Failed phase. Our health check is run in every reconcile cycle. So, if for 3 invocations or 1 min, if we continously confirm the absence of the machine, then the confidence interval should be sufficient enough to move the VM to Failed phase. We also log this (and event this) when we do so.

@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Dec 13, 2023
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Aug 21, 2024
@ashwani2k
Copy link
Contributor

ashwani2k commented Oct 16, 2024

While implementing this check we should not take early actions if the automatic recovery(https://aws.amazon.com/about-aws/whats-new/2022/03/amazon-ec2-default-automatic-recovery/) is already happening for a failed instance as this is default on in our usage.
However if the VM is gone with an orphaned Node and a Machine then we should not wait for health timeout and replace the machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/robustness Robustness, reliability, resilience related kind/enhancement Enhancement, improvement, extension lifecycle/rotten Nobody worked on this for 12 months (final aging stage) needs/planning Needs (more) planning with other MCM maintainers priority/backlog Issues which needs implementation priority/2 Priority (lower number equals higher priority)
Projects
None yet
Development

No branches or pull requests

6 participants