Increase raft notify buffer. #6863

hanshasselberg · 2019-12-02T21:44:11Z

Increasing the buffer helps recovering from leader flapping. It lowers
the chances of the flapping leader to get into a deadlock situation like
described in #6852.

Fixes #6852. Increasing the buffer helps recovering from leader flapping. It lowers the chances of the flapping leader to get into a deadlock situation like described in #6852. * [ ] test in an actual cluster

codecov · 2019-12-02T21:56:12Z

Codecov Report

Merging #6863 into master will increase coverage by 0.04%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #6863      +/-   ##
==========================================
+ Coverage   65.77%   65.81%   +0.04%     
==========================================
  Files         435      435              
  Lines       52405    52405              
==========================================
+ Hits        34470    34492      +22     
+ Misses      13798    13779      -19     
+ Partials     4137     4134       -3

Impacted Files	Coverage Δ
agent/consul/server.go	`56.19% <100%> (ø)`	⬆️
agent/consul/flood.go	`90.9% <0%> (-6.07%)`	⬇️
agent/consul/raft_rpc.go	`78.72% <0%> (-4.26%)`	⬇️
api/lock.go	`81.32% <0%> (-1.81%)`	⬇️
agent/consul/connect_ca_endpoint.go	`56.52% <0%> (-0.67%)`	⬇️
agent/consul/state/catalog.go	`71.03% <0%> (-0.17%)`	⬇️
api/watch/funcs.go	`75.64% <0%> (ø)`	⬆️
command/debug/debug.go	`66.56% <0%> (ø)`	⬆️
agent/consul/leader.go	`69.74% <0%> (+0.25%)`	⬆️
api/semaphore.go	`81.03% <0%> (+0.86%)`	⬆️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aed5cb7...bdef176. Read the comment docs.

banks · 2019-12-13T14:40:21Z

agent/consul/server.go

@@ -722,7 +722,7 @@ func (s *Server) setupRaft() error {
 	}

 	// Set up a channel for reliable leader notifications.
-	raftNotifyCh := make(chan bool, 1)
+	raftNotifyCh := make(chan bool, 1000)


🤔 I wonder how to reason about how much buffering "is enough". This is certainly better than just 1 element but what is the bound to avoid deadlock entirely? I.e. if you have 1000 raft transactions per second and very fast leader flap can it still deadlock?

Based on the analysis in the ticket, this limits how many times we can loose leadership and gain it again before we are at risk of deadlock. Both events send to this chan although gaining leadership doesn't block raft to send.

So we would need to allow as many leadership changes as can take place in the time raft is unable to service status requests from autopilot to avoid the same deadlock. In theory that is unbounded though if the server is under heavy CPU load and can't schedule the autopilot or raft go routines often enough - this will cause flappy leadership as well as making it hard to reason about how long such a situation could continue for.

That said, I think even increasing this a little bit is probably enough to significantly reduce the chance of this deadlock while the real fix would be to allow raft status reading to time out and/or not block on the raft loop at all as in hashicorp/raft#356.

So how about making this 10 for now and updating the comment with a link to this PR?

Yes I think it still can. Because the buffer could still fill up and block. I think we could merge this PR and then add another one for aggressively reading that chan. Or I can add it here.

banks

Sorry I proposed this last year and never submitted :(

banks · 2019-12-13T15:41:03Z

agent/consul/server.go

@@ -722,7 +722,7 @@ func (s *Server) setupRaft() error {
 	}

 	// Set up a channel for reliable leader notifications.
-	raftNotifyCh := make(chan bool, 1)
+	raftNotifyCh := make(chan bool, 1000)


Based on the analysis in the ticket, this limits how many times we can loose leadership and gain it again before we are at risk of deadlock. Both events send to this chan although gaining leadership doesn't block raft to send.

So we would need to allow as many leadership changes as can take place in the time raft is unable to service status requests from autopilot to avoid the same deadlock. In theory that is unbounded though if the server is under heavy CPU load and can't schedule the autopilot or raft go routines often enough - this will cause flappy leadership as well as making it hard to reason about how long such a situation could continue for.

That said, I think even increasing this a little bit is probably enough to significantly reduce the chance of this deadlock while the real fix would be to allow raft status reading to time out and/or not block on the raft loop at all as in hashicorp/raft#356.

So how about making this 10 for now and updating the comment with a link to this PR?

hanshasselberg · 2020-01-06T13:47:03Z

Sounds good @banks! I will make the changes and will also head over to the stats issues and propose a solution.

banks

Thanks Hans!

Increase raft notify buffer.

372536f

Fixes #6852. Increasing the buffer helps recovering from leader flapping. It lowers the chances of the flapping leader to get into a deadlock situation like described in #6852. * [ ] test in an actual cluster

hanshasselberg requested a review from a team December 6, 2019 20:52

banks reviewed Dec 13, 2019

View reviewed changes

hanshasselberg self-assigned this Dec 16, 2019

banks requested changes Jan 6, 2020

View reviewed changes

ten should be enough for everyone.

bdef176

hanshasselberg added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 13, 2020

hanshasselberg requested a review from banks January 13, 2020 08:50

stale bot removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 13, 2020

hanshasselberg added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 13, 2020

banks approved these changes Jan 22, 2020

View reviewed changes

stale bot removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 22, 2020

hanshasselberg merged commit 7d6ea82 into master Jan 22, 2020

hanshasselberg deleted the increase_buffer branch January 22, 2020 15:16

briankassouf mentioned this pull request Apr 1, 2020

storage/raft: Buffer leader notify channel more aggressively hashicorp/vault#8650

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase raft notify buffer. #6863

Increase raft notify buffer. #6863

hanshasselberg commented Dec 2, 2019 •

edited

Loading

codecov bot commented Dec 2, 2019 •

edited

Loading

banks Dec 13, 2019

banks Dec 13, 2019

hanshasselberg Dec 13, 2019

banks left a comment

banks Dec 13, 2019

hanshasselberg commented Jan 6, 2020

banks left a comment

Increase raft notify buffer. #6863

Increase raft notify buffer. #6863

Conversation

hanshasselberg commented Dec 2, 2019 • edited Loading

codecov bot commented Dec 2, 2019 • edited Loading

Codecov Report

banks Dec 13, 2019

Choose a reason for hiding this comment

banks Dec 13, 2019

Choose a reason for hiding this comment

hanshasselberg Dec 13, 2019

Choose a reason for hiding this comment

banks left a comment

Choose a reason for hiding this comment

banks Dec 13, 2019

Choose a reason for hiding this comment

hanshasselberg commented Jan 6, 2020

banks left a comment

Choose a reason for hiding this comment

hanshasselberg commented Dec 2, 2019 •

edited

Loading

codecov bot commented Dec 2, 2019 •

edited

Loading