-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase raft notify buffer. #6863
Conversation
Codecov Report
@@ Coverage Diff @@
## master #6863 +/- ##
==========================================
+ Coverage 65.77% 65.81% +0.04%
==========================================
Files 435 435
Lines 52405 52405
==========================================
+ Hits 34470 34492 +22
+ Misses 13798 13779 -19
+ Partials 4137 4134 -3
Continue to review full report at Codecov.
|
agent/consul/server.go
Outdated
@@ -722,7 +722,7 @@ func (s *Server) setupRaft() error { | |||
} | |||
|
|||
// Set up a channel for reliable leader notifications. | |||
raftNotifyCh := make(chan bool, 1) | |||
raftNotifyCh := make(chan bool, 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I wonder how to reason about how much buffering "is enough". This is certainly better than just 1 element but what is the bound to avoid deadlock entirely? I.e. if you have 1000 raft transactions per second and very fast leader flap can it still deadlock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the analysis in the ticket, this limits how many times we can loose leadership and gain it again before we are at risk of deadlock. Both events send to this chan although gaining leadership doesn't block raft to send.
So we would need to allow as many leadership changes as can take place in the time raft is unable to service status requests from autopilot to avoid the same deadlock. In theory that is unbounded though if the server is under heavy CPU load and can't schedule the autopilot or raft go routines often enough - this will cause flappy leadership as well as making it hard to reason about how long such a situation could continue for.
That said, I think even increasing this a little bit is probably enough to significantly reduce the chance of this deadlock while the real fix would be to allow raft status reading to time out and/or not block on the raft loop at all as in hashicorp/raft#356.
So how about making this 10
for now and updating the comment with a link to this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think it still can. Because the buffer could still fill up and block. I think we could merge this PR and then add another one for aggressively reading that chan. Or I can add it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I proposed this last year and never submitted :(
agent/consul/server.go
Outdated
@@ -722,7 +722,7 @@ func (s *Server) setupRaft() error { | |||
} | |||
|
|||
// Set up a channel for reliable leader notifications. | |||
raftNotifyCh := make(chan bool, 1) | |||
raftNotifyCh := make(chan bool, 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the analysis in the ticket, this limits how many times we can loose leadership and gain it again before we are at risk of deadlock. Both events send to this chan although gaining leadership doesn't block raft to send.
So we would need to allow as many leadership changes as can take place in the time raft is unable to service status requests from autopilot to avoid the same deadlock. In theory that is unbounded though if the server is under heavy CPU load and can't schedule the autopilot or raft go routines often enough - this will cause flappy leadership as well as making it hard to reason about how long such a situation could continue for.
That said, I think even increasing this a little bit is probably enough to significantly reduce the chance of this deadlock while the real fix would be to allow raft status reading to time out and/or not block on the raft loop at all as in hashicorp/raft#356.
So how about making this 10
for now and updating the comment with a link to this PR?
Sounds good @banks! I will make the changes and will also head over to the stats issues and propose a solution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Hans!
Fixes #6852.
Increasing the buffer helps recovering from leader flapping. It lowers
the chances of the flapping leader to get into a deadlock situation like
described in #6852.