Stats calls take arbitrarily long when raft is being used heavily. #356

banks · 2019-07-24T12:54:27Z

In Consul we often see any endpoint that needs to fetch raft stats like /agent/self hang for seconds or minutes on server nodes under write load.

It's usually a symptom of some other problem with raft or resource contention but looking at the code here:

raft/api.go

Lines 995 to 1004 in ff523e1

    
           lastLogIndex, lastLogTerm := r.getLastLog() 
        
           lastSnapIndex, lastSnapTerm := r.getLastSnapshot() 
        
           s := map[string]string{ 
        
           	"state":                r.getState().String(), 
        
           	"term":                 toString(r.getCurrentTerm()), 
        
           	"last_log_index":       toString(lastLogIndex), 
        
           	"last_log_term":        toString(lastLogTerm), 
        
           	"commit_index":         toString(r.getCommitIndex()), 
        
           	"applied_index":        toString(r.getLastApplied()), 
        
           	"fsm_pending":          toString(uint64(len(r.fsmMutateCh))),

I notice a few ways we could make this better:

we call getLast* a few times which both take an exclusive lock on lastLock. I doubt this is highly contended though
we wait for configuration which comes through the main raft loop. If raft is busy restoring a snapshot or handling lots of diskIO then this can likely take a long time. I'm not sure of a simple answer here but we might get away with caching the config and pushing updates to it on reconfigurations then being able to read that under lock rather than wait for the main thread every time

I guess the second point is the real root cause during our observed issues.

The text was updated successfully, but these errors were encountered:

catsby · 2019-09-20T16:14:07Z

Hey @banks - I poked at the second point here for a bit, but I hit a wall. The problem I see is that GetConfiguration() and future.Configuration() end up calling Configurations.Clone and/or Configuration.Clone, which both copy over an array of Servers. I’m not sure how you can cache those as the returns are expected to be deep copies. At least, the tests fail if they are not deep copies, meaning we can’t cache and return the same copy multiple times.

Am I missing something obvious, or is there a way to do this that I don't know?

catsby · 2019-09-20T16:23:37Z

Related issue:

Too long Raft.Stats call because of not stable in time AppendEntries.StoreLogs routine #302

banks · 2019-09-20T19:02:43Z

@catsby I didn't have a firm plan or detailed look but I was thinking you would effectively denomalise that result into one that can be read under a lock outside of the main loop and then have the main loop update it (under lock) when it changes rather than have to wait for the main loop every time it needs to be read from. Does that make sense?

stale · 2019-11-19T21:27:01Z

Hey there,
We wanted to check in on this request since it has been inactive for at least 90 days.
Have you reviewed the latest godocs?
If you think this is still an important issue in the latest version of the Raft library or
its documentation please feel let us know and we'll keep it open for investigation.
If there is still no activity on this request in 30 days, we will go ahead and close it.
Thank you!

catsby · 2019-11-19T21:31:36Z

Hey stale - yeah still planning on fixing this, thanks

banks · 2019-11-29T17:04:09Z

This came up again in a new form today.

In Consul we call this method in a few places and in this one very specific sequence of events it actually caused a deadlock of the whole Consul server since raft was transitioning out of being a leader to follower so wasn't servicing leader loop, but blocked trying to notify Consul of the loss of leadership, Mean while Consul's Autopilot is running and trying to read raft stats which blocks as nothing is servicing the leader or follower loops.

The final part that makes this a deadlock is that Consul syncronously blocks until all leader routines are cleaned up before servicing the notify channel to see if raft changed state.

That's a bug in Consul to allow the deadlock, but it would in this case be fixed if we find a solution for making GetConfiguration non-blocking on the leader/follower loop.

hanshasselberg · 2020-01-08T11:26:35Z

PR was merged for that issue, closing.

banks added the performance label Jul 24, 2019

catsby mentioned this issue Sep 20, 2019

Too long Raft.Stats call because of not stable in time AppendEntries.StoreLogs routine #302

Closed

stale bot added the waiting-reply label Nov 19, 2019

stale bot removed the waiting-reply label Nov 19, 2019

banks added the bug label Nov 20, 2019

hanshasselberg mentioned this issue Nov 29, 2019

Possible deadlock situation when leader is flapping hashicorp/consul#6852

Closed

2 tasks

banks mentioned this issue Dec 13, 2019

Increase raft notify buffer. hashicorp/consul#6863

Merged

hanshasselberg mentioned this issue Jan 7, 2020

Read latest configuration independently from main loop #379

Merged

hanshasselberg closed this as completed Jan 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stats calls take arbitrarily long when raft is being used heavily. #356

Stats calls take arbitrarily long when raft is being used heavily. #356

banks commented Jul 24, 2019

catsby commented Sep 20, 2019

catsby commented Sep 20, 2019

banks commented Sep 20, 2019

stale bot commented Nov 19, 2019

catsby commented Nov 19, 2019

banks commented Nov 29, 2019

hanshasselberg commented Jan 8, 2020

Stats calls take arbitrarily long when raft is being used heavily. #356

Stats calls take arbitrarily long when raft is being used heavily. #356

Comments

banks commented Jul 24, 2019

catsby commented Sep 20, 2019

catsby commented Sep 20, 2019

banks commented Sep 20, 2019

stale bot commented Nov 19, 2019

catsby commented Nov 19, 2019

banks commented Nov 29, 2019

hanshasselberg commented Jan 8, 2020