-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement role re-assignment based on failure domains #268
Implement role re-assignment based on failure domains #268
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I agree that this is a problem that needs to be fixed, and we should fix it in go-dqlite, which doesn't use the server-side role management support (yet).
I have a concern about the behavior change here. The new logic may result in promoting a newly-joining node to voter even if we've already hit the voter target, if if's from a failure domain that's not represented yet and at least one other failure domain is "overpopulated" with voters. After the promotion succeeds, the adjustment logic will notice that there are too many voters and will try to demote one of them -- but the demotion logic is not aware of failure domains, so it's possible that we pick a node for demotion that's the only voter in its failure domain, and the end result is no better than what we started with. To fix this, we should use sortCandidates (or a variant) to guide the demotion process for online voters.
I did not think about this. You are right. The latest commit adds a domain aware sorting function to be used during demotion. To reproduce this problematic behavior I used the following test:
This test is not part of the code because it triggers the bouncing of the voters but it does not detect it. You need a debugger or more logging to see the voter bouncing. |
@cole-miller any thoughts on the latest updates? |
@ktsakalozos -- sorry for the delay, I've been on PTO. Taking a look at the latest branch now. |
Allow node roles to follow failure domains.
If we do not have voters on all failure domains and we have a domains with more than one voters we should try to start voters on the failure domains without voters.
Fixes: #267
Understandably this may not be the direction we want to move as the role assignment has recently moved to the dqlite project.