React faster if VM instance is gone (i.e. don’t wait until full machineHealthTimeout/machineCreationTimeout lapses) #755

vlerenc · 2022-10-24T08:54:26Z

What happened:

While investigating issues with pending workload during a rolling update that looked as if drain started on nodes without sufficient new capacity (see also issue #754), we found out that VM instances that will never make it are not considered as such and the full machineHealthTimeout/machineCreationTimeout needs to lapse, which slows down auto-scaling/self-healing/rolling-update and may lead to prolonged pending pod times.

What was expected:

When a VM instance backing a machine is in fatal error or even terminated, we should probably act faster. While it was observed that hyperscaler API endpoints may temporarily serve stale or broken or no data at all (these systems are highly distributed and only eventually consistent) we should still act, if e.g.:

the API returns the VM instance with a status of in fatal error or terminated AND...
the kubelet stopped reporting at around or before the VM instance last timestamp then...
a much smaller timeout of e.g. 1m (instead of 10m/20m default for machineHealthTimeout/machineCreationTimeout) lapsed.

This should help to recover much faster in the future. WDYT?

How to reproduce it:

Terminate a running VM instance backing a machine resource or create a new machine resource (e.g. scale-out), but block (e.g. immediately terminate) the new VM instance from coming up. You will see that the machineHealthTimeout/machineCreationTimeout have to lapse before something happens even though the machines are in terminal state from which they will never recover and the kubelets do no longer report in.

while true; do sleep 0.1; aws ec2 describe-instances --filters Name=tag-key,Values=kubernetes.io/cluster/shoot--project--cluster Name=instance-state-code,Values=0 Name=availability-zone,Values=eu-central-1c --no-paginate --output json | jq -r .Reservations[].Instances[].InstanceId | tr '\n' ' ' | xargs -I {} sh -c 'echo Terminating Instances:  {}; aws ec2 terminate-instances --instance-ids {} | cat'; done

The text was updated successfully, but these errors were encountered:

sskrustev · 2023-01-10T10:00:46Z

Hello Vedran Lerenc,

This is kind reminder . Are there any updates here ?

Kind regards,
Stanimir

elankath · 2023-02-24T06:41:21Z

Grooming Results

Once the machine phase has moved to Unknown, we should check the the error code of Driver.GetMachineStatus call to see if the VM instance has gone.
If so, we should not wait for the health-check-timeout before moving the machine phase to Failed.
Instead of performing the GetMachineStatus call only for an Unknown machine, also consider regularly (say every few iterations of the reconcileHealthCheck) to check the VM instance status and moving the machine phase to Failed if VM is not there. This will obviate the need for the long Running -> Unknown -> Failed movement across consecutive invocations of health check reconciliation.

himanshu-kun · 2023-02-24T06:53:16Z

One possible solution , is to rely on CCM, as it deletes node obj if VM is gone. So we won't have to check the VM on infra atleast in machineHealthTimeout case. But the node can also be explicitly deleted by operator/user, then we need to test it out that the node obj doesn't come up again, because if it does then checking VM status would be the safest thing to do.

vlerenc · 2023-04-05T06:01:31Z

we should check the the error code of Driver.GetMachineStatus call to see if the VM instance has gone.

The problem with that is that infrastructures such as Azure are notoriously inconsistent for a very long time window (much more than the 10 minutes we try to shave off here). They show the machine, don't show it, show it - it takes a while until all API endpoints return the same result. Then again, if the machine is older than 2h (time window we see this issue, e.g. with Azure) and one API endpoint returns that it's gone, it's probably gone (unless that API endpoint is broken in some other way). We could safe-guard that somewhat more and wait to see this result 3x (ignoring other results were the machine is reported present). Still, risky. We simply cannot trust the cloud provider API to return the correct result due to eventual consistency.

elankath · 2023-04-05T08:40:02Z

The problem with that is that infrastructures such as Azure are notoriously inconsistent for a very long time window (much more than the 10 minutes we try to shave off here). They show the machine, don't show it, show it - it takes a while until all API endpoints return the same result.

Agreed. We don't need to immediately take a decision to move the Machine to Failed phase. Our health check is run in every reconcile cycle. So, if for 3 invocations or 1 min, if we continously confirm the absence of the machine, then the confidence interval should be sufficient enough to move the VM to Failed phase. We also log this (and event this) when we do so.

ashwani2k · 2024-10-16T05:44:25Z

While implementing this check we should not take early actions if the automatic recovery(https://aws.amazon.com/about-aws/whats-new/2022/03/amazon-ec2-default-automatic-recovery/) is already happening for a failed instance as this is default on in our usage.
However if the VM is gone with an orphaned Node and a Machine then we should not wait for health timeout and replace the machine.

vlerenc added kind/enhancement Enhancement, improvement, extension priority/backlog Issues which needs implementation priority/3 Priority (lower number equals higher priority) labels Oct 24, 2022

himanshu-kun mentioned this issue Feb 17, 2023

Autoscaler interfering in meltdown scenario solution #741

Open

himanshu-kun added priority/2 Priority (lower number equals higher priority) needs/planning Needs (more) planning with other MCM maintainers area/robustness Robustness, reliability, resilience related and removed priority/3 Priority (lower number equals higher priority) labels Feb 24, 2023

elankath mentioned this issue Feb 24, 2023

Avoid Counter-Productive Remediations (Melt-Down) during Zonal Outages #760

Closed

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Dec 13, 2023

gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Aug 21, 2024

ashwani2k mentioned this issue Oct 16, 2024

Turn automatic-recovery off by default gardener/machine-controller-manager-provider-aws#94

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

React faster if VM instance is gone (i.e. don’t wait until full machineHealthTimeout/machineCreationTimeout lapses) #755

React faster if VM instance is gone (i.e. don’t wait until full machineHealthTimeout/machineCreationTimeout lapses) #755

vlerenc commented Oct 24, 2022 •

edited by himanshu-kun

Loading

sskrustev commented Jan 10, 2023

elankath commented Feb 24, 2023 •

edited

Loading

himanshu-kun commented Feb 24, 2023 •

edited

Loading

vlerenc commented Apr 5, 2023

elankath commented Apr 5, 2023 •

edited

Loading

ashwani2k commented Oct 16, 2024 •

edited

Loading

React faster if VM instance is gone (i.e. don’t wait until full machineHealthTimeout/machineCreationTimeout lapses) #755

React faster if VM instance is gone (i.e. don’t wait until full machineHealthTimeout/machineCreationTimeout lapses) #755

Comments

vlerenc commented Oct 24, 2022 • edited by himanshu-kun Loading

What happened:

What was expected:

How to reproduce it:

sskrustev commented Jan 10, 2023

elankath commented Feb 24, 2023 • edited Loading

Grooming Results

himanshu-kun commented Feb 24, 2023 • edited Loading

vlerenc commented Apr 5, 2023

elankath commented Apr 5, 2023 • edited Loading

ashwani2k commented Oct 16, 2024 • edited Loading

vlerenc commented Oct 24, 2022 •

edited by himanshu-kun

Loading

elankath commented Feb 24, 2023 •

edited

Loading

himanshu-kun commented Feb 24, 2023 •

edited

Loading

elankath commented Apr 5, 2023 •

edited

Loading

ashwani2k commented Oct 16, 2024 •

edited

Loading