Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Erratic serialized drain if there are large number of volumes attached per node #468

Closed
amshuman-kr opened this issue Jun 8, 2020 · 8 comments
Labels
area/storage Storage related effort/1w Effort for issue is around 1 week kind/bug Bug lifecycle/rotten Nobody worked on this for 12 months (final aging stage) priority/5 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)

Comments

@amshuman-kr
Copy link

amshuman-kr commented Jun 8, 2020

What happened:
In a provider like GCP or Azure (where a relatively large number of volumes are allowed to be attached per node), serialized eviction of pods with volumes while draining a node shows some erratic behaviour. Most pod evictions (and the corresponding volume detachment) takes between 4s to 15s. But if there are a large number of volumes attached to a node (>= 40), sometimes (unpredictably), a bunch of pods are deleted (and their corresponding volumes detached) within a matter of 5ms-10ms.

Though the drain logic thinks that the pods' volumes are detached in a matter of milliseconds, in reality these volumes are not fully detached and this causes disproportionate delays in attachment of the volumes and and startup of the replacement pods.

What you expected to happen:
The serialized eviction of pods should proceed normally irrespective of the number of pods with volume per node.

How to reproduce it (as minimally and precisely as possible):
Steps:

  1. Choose a Kubernetes cluster with nodes hosted in GCP
  2. Deploy a large number of pods with volumes (>=40) into a single node (e.g. using a combination of nodeAffinity, taints and tolerations).
  3. Delete the MCM Machine object backing the node on which the pods are hosted.
  4. Monitor the pod status, node status (especially, node.Status.VolumesAttached) and MCM logs.
  5. For the most part the serialized eviction goes on as designed with an interval of anywhere between 4 to 15s per pod with volume. But sometimes a bunch of pods are evicted and volumes are detached in a matter of milliseconds. This happens rarely and unpredictably. This erratic behaviour can be reproduced more reliably with even larger number of pods with volume (50 or more) per node. I have never seen this happen with <=20 volumes per node.

Anything else we need to know:
MCM watched node.Status.VolumesAttached to check if a volume has been detached after the corresponding pod has been evicted. But I have noticed inconsistency in updating of the node.Status.VolumesAttached if there are a large number of volumes attached per node. Sometimes, after eviction of the pod, the corresponding volume gets removed too quickly from node.Status.VolumesAttached but then it reappears in the array, only to disappear again. Sometimes, it even make a few such disappearances and reappearances before going away for good. In this case, MCM would consider the volume to be detached at the first disappearance and would move on to the next pod eviction.

Environment:
provider: GCP or Azure

Approaches for resolution:

  1. Identify the race condition in upstream kubernetes or cloud provider controllers and contribute a fix there.
  2. Add an additional timeout in drain logic in MCM to see if a detached volume stays showing as detached in the node status.
@hardikdr
Copy link
Member

hardikdr commented Sep 7, 2020

/priority critical

@gardener-robot gardener-robot added the priority/critical Needs to be resolved soon, because it impacts users negatively label Sep 7, 2020
@hardikdr
Copy link
Member

hardikdr commented Oct 9, 2020

@ggaurav10 do you see any challenge in adding a minor delay after evicting the volume-based pods - to confirm detached volume is not flapping but gone for good, also how do you generally see the approach?

@ggaurav10
Copy link
Contributor

TL;DR:
Generally, the approach looks good.

Just thinking out loud:
in absence of the upstream fix, i think introducing a configurable delay should be helpful in controlling the eviction. It can even be enabled only when more than a certain number of volumes are attached so that eviction of nodes with lesser volumes is not slowed down. This will also help in testing when k8s finally fixes the apparent race issue.

Just wondering if MCM should wait for that delay only when it sees that the volume got detached "too quickly" (say within 1 second).

@hardikdr
Copy link
Member

We discussed today to pick this up later after the OOT for Azure is out. cc @AxiomSamarth .

@vlerenc
Copy link
Member

vlerenc commented Oct 14, 2020

Right, @hardikdr. Now with kupid we can steer where we want to have our ETCDs and how many of them.
/priority normal

@gardener-robot gardener-robot added priority/normal and removed priority/critical Needs to be resolved soon, because it impacts users negatively labels Oct 14, 2020
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Dec 14, 2020
@gardener-robot gardener-robot added priority/3 Priority (lower number equals higher priority) and removed priority/normal labels Mar 8, 2021
@prashanth26 prashanth26 added priority/5 Priority (lower number equals higher priority) effort/1w Effort for issue is around 1 week and removed priority/3 Priority (lower number equals higher priority) status/new Issue is new and unprocessed labels Jul 21, 2021
@prashanth26
Copy link
Contributor

To be fixed with #621

@prashanth26 prashanth26 added the area/storage Storage related label Jul 21, 2021
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jan 18, 2022
@elankath
Copy link
Contributor

This problem is solved now with since in the current drain code, we don't just wait for volume detach, but we also wait for volume attachment to another node. So, even if volume transiently disappears from oldNode.Status.VolumesAttached it doesn't matter much since we also wait till it arrives in newNode.Status.VolumesAttached

@himanshu-kun
Copy link
Contributor

/close as per explanation given by Tarun above

@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Feb 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/storage Storage related effort/1w Effort for issue is around 1 week kind/bug Bug lifecycle/rotten Nobody worked on this for 12 months (final aging stage) priority/5 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

No branches or pull requests

8 participants