Erratic serialized drain if there are large number of volumes attached per node #468

amshuman-kr · 2020-06-08T12:01:13Z

What happened:
In a provider like GCP or Azure (where a relatively large number of volumes are allowed to be attached per node), serialized eviction of pods with volumes while draining a node shows some erratic behaviour. Most pod evictions (and the corresponding volume detachment) takes between 4s to 15s. But if there are a large number of volumes attached to a node (>= 40), sometimes (unpredictably), a bunch of pods are deleted (and their corresponding volumes detached) within a matter of 5ms-10ms.

Though the drain logic thinks that the pods' volumes are detached in a matter of milliseconds, in reality these volumes are not fully detached and this causes disproportionate delays in attachment of the volumes and and startup of the replacement pods.

What you expected to happen:
The serialized eviction of pods should proceed normally irrespective of the number of pods with volume per node.

How to reproduce it (as minimally and precisely as possible):
Steps:

Choose a Kubernetes cluster with nodes hosted in GCP
Deploy a large number of pods with volumes (>=40) into a single node (e.g. using a combination of nodeAffinity, taints and tolerations).
Delete the MCM Machine object backing the node on which the pods are hosted.
Monitor the pod status, node status (especially, node.Status.VolumesAttached) and MCM logs.
For the most part the serialized eviction goes on as designed with an interval of anywhere between 4 to 15s per pod with volume. But sometimes a bunch of pods are evicted and volumes are detached in a matter of milliseconds. This happens rarely and unpredictably. This erratic behaviour can be reproduced more reliably with even larger number of pods with volume (50 or more) per node. I have never seen this happen with <=20 volumes per node.

Anything else we need to know:
MCM watched node.Status.VolumesAttached to check if a volume has been detached after the corresponding pod has been evicted. But I have noticed inconsistency in updating of the node.Status.VolumesAttached if there are a large number of volumes attached per node. Sometimes, after eviction of the pod, the corresponding volume gets removed too quickly from node.Status.VolumesAttached but then it reappears in the array, only to disappear again. Sometimes, it even make a few such disappearances and reappearances before going away for good. In this case, MCM would consider the volume to be detached at the first disappearance and would move on to the next pod eviction.

Environment:
provider: GCP or Azure

Approaches for resolution:

Identify the race condition in upstream kubernetes or cloud provider controllers and contribute a fix there.
Add an additional timeout in drain logic in MCM to see if a detached volume stays showing as detached in the node status.

The text was updated successfully, but these errors were encountered:

hardikdr · 2020-09-07T07:56:10Z

/priority critical

hardikdr · 2020-10-09T18:35:31Z

@ggaurav10 do you see any challenge in adding a minor delay after evicting the volume-based pods - to confirm detached volume is not flapping but gone for good, also how do you generally see the approach?

ggaurav10 · 2020-10-13T12:40:48Z

TL;DR:
Generally, the approach looks good.

Just thinking out loud:
in absence of the upstream fix, i think introducing a configurable delay should be helpful in controlling the eviction. It can even be enabled only when more than a certain number of volumes are attached so that eviction of nodes with lesser volumes is not slowed down. This will also help in testing when k8s finally fixes the apparent race issue.

Just wondering if MCM should wait for that delay only when it sees that the volume got detached "too quickly" (say within 1 second).

hardikdr · 2020-10-14T13:16:17Z

We discussed today to pick this up later after the OOT for Azure is out. cc @AxiomSamarth .

vlerenc · 2020-10-14T17:03:15Z

Right, @hardikdr. Now with kupid we can steer where we want to have our ETCDs and how many of them.
/priority normal

prashanth26 · 2021-07-21T05:12:59Z

To be fixed with #621

elankath · 2023-02-14T08:20:35Z

This problem is solved now with since in the current drain code, we don't just wait for volume detach, but we also wait for volume attachment to another node. So, even if volume transiently disappears from oldNode.Status.VolumesAttached it doesn't matter much since we also wait till it arrives in newNode.Status.VolumesAttached

himanshu-kun · 2023-02-14T08:37:33Z

/close as per explanation given by Tarun above

amshuman-kr added the kind/bug Bug label Jun 8, 2020

prashanth26 added the status/new Issue is new and unprocessed label Jun 10, 2020

amshuman-kr mentioned this issue Jun 11, 2020

Controlled Eviction of Pods during Cluster Rollout #470

Open

gardener-robot added the priority/critical Needs to be resolved soon, because it impacts users negatively label Sep 7, 2020

gardener-robot added priority/normal and removed priority/critical Needs to be resolved soon, because it impacts users negatively labels Oct 14, 2020

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Dec 14, 2020

prashanth26 mentioned this issue Feb 12, 2021

GCP machine PV detachment taking too long #367

Closed

gardener-robot added priority/3 Priority (lower number equals higher priority) and removed priority/normal labels Mar 8, 2021

prashanth26 mentioned this issue Mar 30, 2021

Wait until volume is attached on another node #561

Closed

prashanth26 added priority/5 Priority (lower number equals higher priority) effort/1w Effort for issue is around 1 week and removed priority/3 Priority (lower number equals higher priority) status/new Issue is new and unprocessed labels Jul 21, 2021

prashanth26 mentioned this issue Jul 21, 2021

Move drain logic into a separate controller #621

Open

5 tasks

prashanth26 added the area/storage Storage related label Jul 21, 2021

gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jan 18, 2022

gardener-robot closed this as completed Feb 14, 2023

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Feb 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Erratic serialized drain if there are large number of volumes attached per node #468

Erratic serialized drain if there are large number of volumes attached per node #468

amshuman-kr commented Jun 8, 2020 •

edited

Loading

hardikdr commented Sep 7, 2020

hardikdr commented Oct 9, 2020 •

edited

Loading

ggaurav10 commented Oct 13, 2020

hardikdr commented Oct 14, 2020

vlerenc commented Oct 14, 2020

prashanth26 commented Jul 21, 2021

elankath commented Feb 14, 2023

himanshu-kun commented Feb 14, 2023

Erratic serialized drain if there are large number of volumes attached per node #468

Erratic serialized drain if there are large number of volumes attached per node #468

Comments

amshuman-kr commented Jun 8, 2020 • edited Loading

hardikdr commented Sep 7, 2020

hardikdr commented Oct 9, 2020 • edited Loading

ggaurav10 commented Oct 13, 2020

hardikdr commented Oct 14, 2020

vlerenc commented Oct 14, 2020

prashanth26 commented Jul 21, 2021

elankath commented Feb 14, 2023

himanshu-kun commented Feb 14, 2023

amshuman-kr commented Jun 8, 2020 •

edited

Loading

hardikdr commented Oct 9, 2020 •

edited

Loading