-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
syncutil: watch mutexes for deadlock #66765
Comments
This is a great idea for a time-series metric!!
Let's say that the long held mutex is leading a single range to be unavailable. Can we find a way to page on the range being unavailable? I'd rather page on some bad symptom than on a mutex being held too long (could have lower priority alert on the mutex being held too long). |
It should, if we write the code in the right way. Mutex deadlocks are particularly pernicious. They are a low-level failure that should basically never occur, as its fallout is really hard to control and because they are not recoverable (there isn't even a way to "try to acquire the mutex but if it takes too long stop doing it". So whatever control mechanism we have, if it at any point touches the mutex in a blocking way, it itself gets stuck and won't report. So you always need to offload the mutex-touching to a goroutine and then wait for that goroutine to indicate success back to some controller goroutine, which is also the pattern I suggested above. I tend to think that it's not worth trying to let these metrics delve into debugging too much. This creates lots of engineering problems that don't pay off given how unique each deadlock scenario is. The key point is realizing quickly that there is one so that we can react by pulling stacks & cycling the node. RCA comes later, and it seldom matters which replica was hit by the deadlock as it can usually occur on any replica in general. |
Ack & thanks for the color. Also maybe the proposed slow mutex metric will be high enough signal to noise that we can page on it.
Mhm +1. |
We had another incident of a known |
cc @cockroachdb/test-eng |
Some related POC code which I used a in a repro for a deadlock (deadlock never reproed) is here. The important bit in the approach taken in that code is that we need to know the stack of the caller when we detect the slow mutex, which isn't always the case: r.mu.Lock()
if foo() { return }
r.mu.Unlock() // oops leaked mutex if we return above For lots of hot mutexes the approach in #106254 is probably too costly, but maybe not and definitely not for all. For example, for |
Is your feature request related to a problem? Please describe.
We just dealt with a deadlock due to a re-entrant RLock acquisition (#66760). Deadlocks should be weeded out by testing, no question. But we should also expose them when they happen, as restarting the node on which they occur can be an effective mitigation tool.
Describe the solution you'd like
Bare bones
server.slow.mutex
(name open to discussion)syncutil.{RW,}Mutex
so that each mutex is tracked by a watcher that, at regular intervals, attempts togo func() { Lock(); Unlock()}()
the mutex. If this takes longer than X (sensible values might depend on the mutex but a default of 10s is likely a good start) in the sense thatLock()
hasn't returned within X, increment the gauge. If it then ever manages to actually lock and then unlocks, decrement the gauge.server.slow.mutex
tocockroach/pkg/server/status/health_check.go
Lines 44 to 46 in 4126527
gaugeZero
, so whenever it's nonzero we will print something in the logs.Enhanced
We could give each mutex a (static, i.e. one name shared across all
Replica.mu
for example) name and maintain, in addition to the above, a family of histograms of the latency ofLock()
for each name (with somewhat tricky semantics, we want to make sure that if we deadlock onLock()
the histogram still gets populated, we can do something here, tbd.When a mutex deadlocks, metrics would hence show us which one.
We probably don't want to hard-code the metrics names into timeseries names, which is what our internal timeseries would want. So this would be a prometheus-only metric. There is precedent for that in the tenant metrics, so this isn't new.
Describe alternatives you've considered
There are many variations on the above. We just need to pick one and do it.
Additional context
See #66760
Jira issue: CRDB-8219
The text was updated successfully, but these errors were encountered: