-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stats calls take arbitrarily long when raft is being used heavily. #356
Comments
Hey @banks - I poked at the second point here for a bit, but I hit a wall. The problem I see is that Am I missing something obvious, or is there a way to do this that I don't know? |
@catsby I didn't have a firm plan or detailed look but I was thinking you would effectively denomalise that result into one that can be read under a lock outside of the main loop and then have the main loop update it (under lock) when it changes rather than have to wait for the main loop every time it needs to be read from. Does that make sense? |
Hey there, |
Hey |
This came up again in a new form today. In Consul we call this method in a few places and in this one very specific sequence of events it actually caused a deadlock of the whole Consul server since raft was transitioning out of being a leader to follower so wasn't servicing leader loop, but blocked trying to notify Consul of the loss of leadership, Mean while Consul's Autopilot is running and trying to read raft stats which blocks as nothing is servicing the leader or follower loops. The final part that makes this a deadlock is that Consul syncronously blocks until all leader routines are cleaned up before servicing the notify channel to see if raft changed state. That's a bug in Consul to allow the deadlock, but it would in this case be fixed if we find a solution for making GetConfiguration non-blocking on the leader/follower loop. |
PR was merged for that issue, closing. |
In Consul we often see any endpoint that needs to fetch raft stats like
/agent/self
hang for seconds or minutes on server nodes under write load.It's usually a symptom of some other problem with raft or resource contention but looking at the code here:
raft/api.go
Lines 995 to 1004 in ff523e1
I notice a few ways we could make this better:
getLast*
a few times which both take an exclusive lock onlastLock
. I doubt this is highly contended thoughI guess the second point is the real root cause during our observed issues.
The text was updated successfully, but these errors were encountered: