Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stats calls take arbitrarily long when raft is being used heavily. #356

Closed
banks opened this issue Jul 24, 2019 · 7 comments
Closed

Stats calls take arbitrarily long when raft is being used heavily. #356

banks opened this issue Jul 24, 2019 · 7 comments

Comments

@banks
Copy link
Member

banks commented Jul 24, 2019

In Consul we often see any endpoint that needs to fetch raft stats like /agent/self hang for seconds or minutes on server nodes under write load.

It's usually a symptom of some other problem with raft or resource contention but looking at the code here:

raft/api.go

Lines 995 to 1004 in ff523e1

lastLogIndex, lastLogTerm := r.getLastLog()
lastSnapIndex, lastSnapTerm := r.getLastSnapshot()
s := map[string]string{
"state": r.getState().String(),
"term": toString(r.getCurrentTerm()),
"last_log_index": toString(lastLogIndex),
"last_log_term": toString(lastLogTerm),
"commit_index": toString(r.getCommitIndex()),
"applied_index": toString(r.getLastApplied()),
"fsm_pending": toString(uint64(len(r.fsmMutateCh))),

I notice a few ways we could make this better:

  • we call getLast* a few times which both take an exclusive lock on lastLock. I doubt this is highly contended though
  • we wait for configuration which comes through the main raft loop. If raft is busy restoring a snapshot or handling lots of diskIO then this can likely take a long time. I'm not sure of a simple answer here but we might get away with caching the config and pushing updates to it on reconfigurations then being able to read that under lock rather than wait for the main thread every time

I guess the second point is the real root cause during our observed issues.

@catsby
Copy link
Contributor

catsby commented Sep 20, 2019

Hey @banks - I poked at the second point here for a bit, but I hit a wall. The problem I see is that GetConfiguration() and future.Configuration() end up calling Configurations.Clone and/or Configuration.Clone, which both copy over an array of Servers. I’m not sure how you can cache those as the returns are expected to be deep copies. At least, the tests fail if they are not deep copies, meaning we can’t cache and return the same copy multiple times.

Am I missing something obvious, or is there a way to do this that I don't know?

@catsby
Copy link
Contributor

catsby commented Sep 20, 2019

@banks
Copy link
Member Author

banks commented Sep 20, 2019

@catsby I didn't have a firm plan or detailed look but I was thinking you would effectively denomalise that result into one that can be read under a lock outside of the main loop and then have the main loop update it (under lock) when it changes rather than have to wait for the main loop every time it needs to be read from. Does that make sense?

@stale
Copy link

stale bot commented Nov 19, 2019

Hey there,
We wanted to check in on this request since it has been inactive for at least 90 days.
Have you reviewed the latest godocs?
If you think this is still an important issue in the latest version of the Raft library or
its documentation please feel let us know and we'll keep it open for investigation.
If there is still no activity on this request in 30 days, we will go ahead and close it.
Thank you!

@stale stale bot added the waiting-reply label Nov 19, 2019
@catsby
Copy link
Contributor

catsby commented Nov 19, 2019

Hey stale - yeah still planning on fixing this, thanks

@banks
Copy link
Member Author

banks commented Nov 29, 2019

This came up again in a new form today.

In Consul we call this method in a few places and in this one very specific sequence of events it actually caused a deadlock of the whole Consul server since raft was transitioning out of being a leader to follower so wasn't servicing leader loop, but blocked trying to notify Consul of the loss of leadership, Mean while Consul's Autopilot is running and trying to read raft stats which blocks as nothing is servicing the leader or follower loops.

The final part that makes this a deadlock is that Consul syncronously blocks until all leader routines are cleaned up before servicing the notify channel to see if raft changed state.

That's a bug in Consul to allow the deadlock, but it would in this case be fixed if we find a solution for making GetConfiguration non-blocking on the leader/follower loop.

@hanshasselberg
Copy link
Member

PR was merged for that issue, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants