Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase raft notify buffer. #6863

Merged
merged 2 commits into from
Jan 22, 2020
Merged

Increase raft notify buffer. #6863

merged 2 commits into from
Jan 22, 2020

Conversation

hanshasselberg
Copy link
Member

@hanshasselberg hanshasselberg commented Dec 2, 2019

Fixes #6852.

Increasing the buffer helps recovering from leader flapping. It lowers
the chances of the flapping leader to get into a deadlock situation like
described in #6852.

Fixes #6852.

Increasing the buffer helps recovering from leader flapping. It lowers
the chances of the flapping leader to get into a deadlock situation like
described in #6852.

* [ ] test in an actual cluster
@codecov
Copy link

codecov bot commented Dec 2, 2019

Codecov Report

Merging #6863 into master will increase coverage by 0.04%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6863      +/-   ##
==========================================
+ Coverage   65.77%   65.81%   +0.04%     
==========================================
  Files         435      435              
  Lines       52405    52405              
==========================================
+ Hits        34470    34492      +22     
+ Misses      13798    13779      -19     
+ Partials     4137     4134       -3
Impacted Files Coverage Δ
agent/consul/server.go 56.19% <100%> (ø) ⬆️
agent/consul/flood.go 90.9% <0%> (-6.07%) ⬇️
agent/consul/raft_rpc.go 78.72% <0%> (-4.26%) ⬇️
api/lock.go 81.32% <0%> (-1.81%) ⬇️
agent/consul/connect_ca_endpoint.go 56.52% <0%> (-0.67%) ⬇️
agent/consul/state/catalog.go 71.03% <0%> (-0.17%) ⬇️
api/watch/funcs.go 75.64% <0%> (ø) ⬆️
command/debug/debug.go 66.56% <0%> (ø) ⬆️
agent/consul/leader.go 69.74% <0%> (+0.25%) ⬆️
api/semaphore.go 81.03% <0%> (+0.86%) ⬆️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aed5cb7...bdef176. Read the comment docs.

@hanshasselberg hanshasselberg requested a review from a team December 6, 2019 20:52
@@ -722,7 +722,7 @@ func (s *Server) setupRaft() error {
}

// Set up a channel for reliable leader notifications.
raftNotifyCh := make(chan bool, 1)
raftNotifyCh := make(chan bool, 1000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I wonder how to reason about how much buffering "is enough". This is certainly better than just 1 element but what is the bound to avoid deadlock entirely? I.e. if you have 1000 raft transactions per second and very fast leader flap can it still deadlock?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the analysis in the ticket, this limits how many times we can loose leadership and gain it again before we are at risk of deadlock. Both events send to this chan although gaining leadership doesn't block raft to send.

So we would need to allow as many leadership changes as can take place in the time raft is unable to service status requests from autopilot to avoid the same deadlock. In theory that is unbounded though if the server is under heavy CPU load and can't schedule the autopilot or raft go routines often enough - this will cause flappy leadership as well as making it hard to reason about how long such a situation could continue for.

That said, I think even increasing this a little bit is probably enough to significantly reduce the chance of this deadlock while the real fix would be to allow raft status reading to time out and/or not block on the raft loop at all as in hashicorp/raft#356.

So how about making this 10 for now and updating the comment with a link to this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think it still can. Because the buffer could still fill up and block. I think we could merge this PR and then add another one for aggressively reading that chan. Or I can add it here.

@hanshasselberg hanshasselberg self-assigned this Dec 16, 2019
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I proposed this last year and never submitted :(

@@ -722,7 +722,7 @@ func (s *Server) setupRaft() error {
}

// Set up a channel for reliable leader notifications.
raftNotifyCh := make(chan bool, 1)
raftNotifyCh := make(chan bool, 1000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the analysis in the ticket, this limits how many times we can loose leadership and gain it again before we are at risk of deadlock. Both events send to this chan although gaining leadership doesn't block raft to send.

So we would need to allow as many leadership changes as can take place in the time raft is unable to service status requests from autopilot to avoid the same deadlock. In theory that is unbounded though if the server is under heavy CPU load and can't schedule the autopilot or raft go routines often enough - this will cause flappy leadership as well as making it hard to reason about how long such a situation could continue for.

That said, I think even increasing this a little bit is probably enough to significantly reduce the chance of this deadlock while the real fix would be to allow raft status reading to time out and/or not block on the raft loop at all as in hashicorp/raft#356.

So how about making this 10 for now and updating the comment with a link to this PR?

@hanshasselberg
Copy link
Member Author

Sounds good @banks! I will make the changes and will also head over to the stats issues and propose a solution.

@hanshasselberg hanshasselberg added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 13, 2020
@hanshasselberg hanshasselberg requested a review from banks January 13, 2020 08:50
@stale stale bot removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 13, 2020
@hanshasselberg hanshasselberg added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 13, 2020
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Hans!

@stale stale bot removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 22, 2020
@hanshasselberg hanshasselberg merged commit 7d6ea82 into master Jan 22, 2020
@hanshasselberg hanshasselberg deleted the increase_buffer branch January 22, 2020 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Possible deadlock situation when leader is flapping
2 participants