-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multiple concurrent node reboot feature #660
Add multiple concurrent node reboot feature #660
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR.
I didn't make a deep review so far (especially from the lock-code).
Currently the logic is able to handle multiple locks at the same time, but all nodes are treated equally. Bigger clusters typically make use of topologies like zones and regions.
Example: You have a six node cluster splitted into three zones (two nodes each) and run kured with concurrency 2. It would be possible that two nodes of the same zone hold the lock concurrently. This would shutdown one zone completely (and typically shouldn't happen).
I think kured needs an understanding for topology when it comes to concurrent reboots. I also described this topic in #620.
Thanks for the review, @ckotzbauer! While I agree that having a smarter way to reboot multiple nodes is a great end-state goal, I think that would be a good phase two of this work. I could imagine for many users, including the author of #620, this PR would likely fit their requirements without having additional knowledge of the topology (or needing that level of control). Another thing to note is that this change is not a breaking change. If a user has a requirement for topology intelligence for multi-node reboots, then kured would work the same way as it did prior to this PR with single node reboots. So this would be an opt-in feature. I'd love to investigate how we could make this multi-node reboot logic better in a subsequent PR! Curious on your thoughts of dividing this work up in separate initiatives. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your adjustments. I did a full review now and I think the code looks good from a logical point of view.
I agree, that this implementation fits for use-cases when you want to reboot a cluster without topologies. It's good, that it is designed as opt-in feature. But we have to add a clear note on this in our README. It is not save to reboot bigger production clusters with topologies and concurrency enabled. Kured can disrupt critical production environments if used incorrectly. So Kured is not "safe to use by default" anymore.
Thank you for the comments, @ckotzbauer!
I'm not sure I understand that statement. By default, kured will still only reboot a single node. As a side note, it looks like the end-to-end tests are failing. I'm unable to repro this on my machine, is it possible to rerun the failures? It appears as though the nodes themselves just never recover, which shouldn't be a kured issue (from my understanding). |
My statement was a bit inaccurate: I mean that increasing concurrency is unsafe if the cluster uses topologies. Yes, I can restart the tests. This one is a flaky failure, but typically only for one of the jobs. |
Thanks for re-running that, @ckotzbauer! Looks like the failures are remaining. Still not getting any issues locally. I combed through the failure logs and nothing seems to be jumping out at me. The kind nodes go into an unscheduled state, which makes me believe that kured is doing the right thing but maybe it's a kind/docker issue? If you have a free moment, could you take a look through the failure logs? Thanks in advance for any assistance! |
Yes, I can do that. I think I find some time on the weekend. |
@ckotzbauer, just wanted to follow up on this. Let me know if you had a chance to look at the failures! Thank you! |
It seems that all nodes are rebooting at once: @trstringer
|
@trstringer Any news on this? |
Just writing down something that came to mind for this PR - it could be useful to constrain simultaneous restarts per topology zone (https://kubernetes.io/docs/reference/labels-annotations-taints/#topologykubernetesiozone) e.g. if I have a ~100 node cluster that spans 3 AWS AZs, allow 2 node restarts per AZ at a time? |
This PR was automatically considered stale due to lack of activity. Please refresh it and/or join our slack channels to highlight it, before it automatically closes (in 7 days). |
This PR was automatically considered stale due to lack of activity. Please refresh it and/or join our slack channels to highlight it, before it automatically closes (in 7 days). |
Remove stale |
Currently in kured a single node can get a lock with Acquire. There could be situations where multiple nodes might want a lock in the event that a cluster can handle multiple nodes being rebooted. This adds the side-by-side implementation for a multiple node lock situation. Signed-off-by: Thomas Stringer <[email protected]>
Signed-off-by: Thomas Stringer <[email protected]>
Signed-off-by: Christian Kotzbauer <[email protected]>
50dcf3e
to
3e8513b
Compare
Signed-off-by: Christian Kotzbauer <[email protected]>
Signed-off-by: Christian Kotzbauer <[email protected]>
Signed-off-by: Christian Kotzbauer <[email protected]>
Signed-off-by: Christian Kotzbauer <[email protected]>
Signed-off-by: Christian Kotzbauer <[email protected]>
Currently in kured a single node can get a lock with Acquire. There could be situations where multiple nodes might want a lock in the event that a cluster can handle multiple nodes being rebooted. This adds the side-by-side implementation for a multiple node lock situation.
Testing done
Added unit tests. Also ran a manual test with
--concurrency=2
. I observed that two nodes rebooted at the same time:And the lock annotation reflected the format for a multiple owner lock: