Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise Pod Handling in Machine Roll-Out #120

Closed
vlerenc opened this issue Jul 20, 2018 · 8 comments · Fixed by #202
Closed

Optimise Pod Handling in Machine Roll-Out #120

vlerenc opened this issue Jul 20, 2018 · 8 comments · Fixed by #202
Assignees
Labels
area/high-availability High availability related component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) kind/discussion Discussion (enaging others in deciding about multiple options) priority/2 Priority (lower number equals higher priority)

Comments

@vlerenc
Copy link
Member

vlerenc commented Jul 20, 2018

When we do a rolling update, we don't give Kubernetes any hints to schedule the drained pods on the new machines rather than the olds ones, right? This in turn means, a rolling update may hit a pod more than once. Is there a chance to "guide" the scheduler in scheduling the displaced pods rather on the new than the old machines, maybe with some temporary node labels but ideally without modifying/patching the deployment/pod specs, which would be pretty messy? Maybe, the easiest solution could be to actually cordon the old machines the moment the first batch of new machines joined the cluster (I am wary to do it earlier because that could mean that pods can't be scheduled anymore until new machines have joined the cluster). What do you think?

@vlerenc vlerenc added kind/discussion Discussion (enaging others in deciding about multiple options) component/machine-controller-manager area/high-availability High availability related labels Jul 20, 2018
@hardikdr
Copy link
Member

At the moment we do not ensure that pods evicted due to Rolling-update do not get back to old machines.
As you already suggested, It would be nice to cordon all the old-versioned machines once rolling-update starts.

@bergerx
Copy link

bergerx commented Jul 21, 2018

cordon the old machines the moment the first batch of new machines joined the cluster

This could be a dangerous assumption that all users will have maxSurge>0. But would still make sense after gardener/gardener#274 is merged, if i get it right.

Another alternative could be applying PreferNoSchedule taint to the old machines rather than cordon them all at once. Seems like a safer alternative that could also cover the maxSurge=0 case, but i'm not really sure.

Also i thought this should be an already solved problem, and maybe we could get benefit from the work of others. But seems like this specific case is not covered by most of the major kubernetes providers:

Also worth checking in related kubernetes SIGs.

@vlerenc
Copy link
Member Author

vlerenc commented Jul 21, 2018

Yes, #274 was necessary in case the cluster is minimal and all resources are used up, which also means, that when it's merged, maxSure must be at least 1, but that it is since the beginning.

We actually hope to find some time later to make the update behaviour more configurable - through the end user that knows the work load (best). At the latest, when we see very large clusters, that will become a necessity. We used to work with Bosh 2 years ago and know that you if you have thousands of machines, you need to work in large waves (significant percentages of the machines) and need hints by the workload owners, otherwise this can't work. At the moment we run with a low maxSurge>0, simply because we respect PDBs, but need to terminate uncooperative pods/nodes after a certain grace period, which means that a low number is safe. But theoretically your workload could be slow to start and that could mean, that hard limits cause too many pods to become unavailable at the same time.

Yes, the others have no great solutions either and yes, PreferNoSchedule is the safer/better way, thank you!

@gardener-robot-ci-1 gardener-robot-ci-1 added lifecycle/stale Nobody worked on this for 6 months (will further age) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Sep 19, 2018
@vlerenc vlerenc added the priority/critical Needs to be resolved soon, because it impacts users negatively label Oct 4, 2018
@vlerenc
Copy link
Member Author

vlerenc commented Oct 4, 2018

@amshuman-kr @prashanth26 @hardikdr Can we implement something here to improve the shoot control plane availability (PreferNoSchedule)?

@hardikdr
Copy link
Member

hardikdr commented Oct 5, 2018

Thanks, PreferNoSchedule sounds good to me, we could basically taint all the old-nodes while rolling-update, this should try not to schedule the pods on old-nodes unless new-ones are not available.

  • On a side note, we will then enable mcm to sync taints to/from machine-objects from/to node-objects - which would anyway be a useful function.

@prashanth26
Copy link
Contributor

prashanth26 commented Oct 5, 2018

I have opened up this issue, that would help us in allowing us to specify taints on nodes which would help us in adding taints and tolerations to a node.

After which we could add this taint of PreferNoSchedule while doing a rolling update as mentioned by @hardikdr above.

Thanks & Regards,
Prashanth

@amshuman-kr
Copy link

amshuman-kr commented Oct 7, 2018

PreferNoSchedule looks good to try and attempt.

@prashanth26
Copy link
Contributor

Hi @ggaurav10 & @hardikdr ,

Could you review this PR when you get time? I have updated the testcases.

@gardener-robot-ci-1 gardener-robot-ci-1 removed the lifecycle/stale Nobody worked on this for 6 months (will further age) label Jan 15, 2019
@ghost ghost added the component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) label Mar 7, 2020
@gardener-robot gardener-robot added priority/2 Priority (lower number equals higher priority) and removed priority/critical Needs to be resolved soon, because it impacts users negatively labels Mar 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/high-availability High availability related component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) kind/discussion Discussion (enaging others in deciding about multiple options) priority/2 Priority (lower number equals higher priority)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants