-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check member failure #92
Conversation
Added test for agents & coordinators. [ci LONG=1] [ci TESTOPTIONS="-test.run ^TestMemberResilience"]
pkg/util/arangod/agency_health.go
Outdated
// The function returns nil when all agents are healthy or an error when something is wrong. | ||
func AreAgentsHealthy(ctx context.Context, clients []Agency) error { | ||
wg := sync.WaitGroup{} | ||
invalidKey := []string{"does-not-exists-149e97e8-4b81-5664-a8a8-9ba93881d64c"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for purpose of beauty: drop the s
in exists
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine apart from one tiny remark and a talky call. :)
This PR adds a new high level layer of checks to improve resilience.
It introduced a phase "Failed" for a member. When a member has reached that phase, there is no hope of recovery and it will be removed. The existing reconciliation rules ensure that a new member will be added.
The resilience check goes over all members and checks for signs that the member is dead beyond hope of recovery. If so, it checks if the member is allowed to be replaced. If that is all the case, the member phase is set to failed. The reconciler will create a plan to remove it.