syncutil: watch mutexes for deadlock #66765

tbg · 2021-06-23T13:43:50Z

Is your feature request related to a problem? Please describe.

We just dealt with a deadlock due to a re-entrant RLock acquisition (#66760). Deadlocks should be weeded out by testing, no question. But we should also expose them when they happen, as restarting the node on which they occur can be an effective mitigation tool.

Describe the solution you'd like

Bare bones

add a gauge server.slow.mutex (name open to discussion)
augment syncutil.{RW,}Mutex so that each mutex is tracked by a watcher that, at regular intervals, attempts to go func() { Lock(); Unlock()}() the mutex. If this takes longer than X (sensible values might depend on the mutex but a default of 10s is likely a good start) in the sense that Lock() hasn't returned within X, increment the gauge. If it then ever manages to actually lock and then unlocks, decrement the gauge.

add server.slow.mutex to

cockroach/pkg/server/status/health_check.go

Lines 44 to 46 in 4126527

    
           var trackedMetrics = map[string]threshold{ 
        
           	// Gauges. 
        
           	"ranges.unavailable":          gaugeZero,

as a gaugeZero, so whenever it's nonzero we will print something in the logs.

Enhanced

We could give each mutex a (static, i.e. one name shared across all Replica.mu for example) name and maintain, in addition to the above, a family of histograms of the latency of Lock() for each name (with somewhat tricky semantics, we want to make sure that if we deadlock on Lock() the histogram still gets populated, we can do something here, tbd.
When a mutex deadlocks, metrics would hence show us which one.
We probably don't want to hard-code the metrics names into timeseries names, which is what our internal timeseries would want. So this would be a prometheus-only metric. There is precedent for that in the tenant metrics, so this isn't new.

Describe alternatives you've considered
There are many variations on the above. We just need to pick one and do it.

Additional context

See #66760

Jira issue: CRDB-8219

The text was updated successfully, but these errors were encountered:

joshimhoff · 2021-06-23T15:00:40Z

This is a great idea for a time-series metric!!

We just dealt with a deadlock due to a re-entrant RLock acquisition (#66760). Deadlocks should be weeded out by testing, no question. But we should also expose them when they happen, as restarting the node on which they occur can be an effective mitigation tool.

Let's say that the long held mutex is leading a single range to be unavailable. Can we find a way to page on the range being unavailable? I'd rather page on some bad symptom than on a mutex being held too long (could have lower priority alert on the mutex being held too long). kvprober would eventually hit the problem range, but that takes a long time depending on cluster size, as you know. Although I sometimes wonder if we should just probe a lot, esp. in serverless setting where CRL owns the host cluster rather than the customer. Can we use a mutex being held too long as a hint to initiate probing? Sounds hard / wrong as then you need to be able to tie the mutex to a range... Would the stuck applied index alert fire in case of #66760? Or would the watch-for-stuck-applied-index code be stuck due to the deadlock also?

tbg · 2021-06-23T15:24:44Z

Would the stuck applied index alert fire in case of #66760?

It should, if we write the code in the right way. Mutex deadlocks are particularly pernicious. They are a low-level failure that should basically never occur, as its fallout is really hard to control and because they are not recoverable (there isn't even a way to "try to acquire the mutex but if it takes too long stop doing it". So whatever control mechanism we have, if it at any point touches the mutex in a blocking way, it itself gets stuck and won't report. So you always need to offload the mutex-touching to a goroutine and then wait for that goroutine to indicate success back to some controller goroutine, which is also the pattern I suggested above.
But then a deadlock in the right mutex will prevent you from scraping the prometheus endpoint (to populate the prometheus metrics you touch the replica mutex, oops, though we can fix that too using the same pattern) so the whole idea may not be too useful in practice, at least not if we're primarily doing it for the replica mutexes.

I tend to think that it's not worth trying to let these metrics delve into debugging too much. This creates lots of engineering problems that don't pay off given how unique each deadlock scenario is. The key point is realizing quickly that there is one so that we can react by pulling stacks & cycling the node. RCA comes later, and it seldom matters which replica was hit by the deadlock as it can usually occur on any replica in general.

joshimhoff · 2021-06-23T15:34:11Z

It should, if we write the code in the right way.

Ack & thanks for the color. Also maybe the proposed slow mutex metric will be high enough signal to noise that we can page on it.

I tend to think that it's not worth trying to let these metrics delve into debugging too much.

Mhm +1.

tbg · 2021-07-01T15:45:28Z

We had another incident of a known Replica.mu deadlock today (user on 21.1.3, fixed in 21.1.4). These are really bad. The node will limp along in unpredictable ways. In the instance today, all raft scheduler goroutines got glued to the mutex so the node was not meaningfully participating in replication any more. On top of detecting such issues, nodes should consider crashing, at least when some mutexes seem to be deadlocked.

blathers-crl · 2023-02-17T18:27:10Z

cc @cockroachdb/test-eng

tbg · 2023-07-21T10:55:27Z

Some related POC code which I used a in a repro for a deadlock (deadlock never reproed) is here. The important bit in the approach taken in that code is that we need to know the stack of the caller when we detect the slow mutex, which isn't always the case:

r.mu.Lock()
if foo() { return }
r.mu.Unlock() // oops leaked mutex if we return above

For lots of hot mutexes the approach in #106254 is probably too costly, but maybe not and definitely not for all. For example, for raftMu it's likely fine, since raftMu is held across IO anyway in the common case.

tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jun 23, 2021

lunevalex added O-postmortem Originated from a Postmortem action item. N-followup Needs followup. labels Jun 23, 2021

tbg mentioned this issue Jul 14, 2021

sql/kv: implement lock_timeout session variable #67513

Closed

tbg mentioned this issue Mar 16, 2022

kvserver: add generalized circuit-breaking to catch, e.g., mutex deadlocks #77366

Open

erikgrinaker mentioned this issue Jan 18, 2023

kvserver: Raft stalls during remote node startup #95433

Closed

exalate-issue-sync bot added the T-testeng TestEng Team label Feb 17, 2023

tbg mentioned this issue Jul 6, 2023

roachtest: failover/chaos/read-only failed #106108

Closed

erikgrinaker added the quality-friday A good issue to work on on Quality Friday label Jul 20, 2023

abarganier self-assigned this Jul 25, 2023

abarganier added the T-observability-inf label Jul 25, 2023

blathers-crl bot added the A-observability-inf label Jul 25, 2023

srosenberg added the P-3 Issues/test failures with no fix SLA label Nov 20, 2023

exalate-issue-sync bot removed the T-observability-inf label Nov 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

syncutil: watch mutexes for deadlock #66765

syncutil: watch mutexes for deadlock #66765

tbg commented Jun 23, 2021 •

edited by rickystewart

Loading

joshimhoff commented Jun 23, 2021 •

edited

Loading

tbg commented Jun 23, 2021

joshimhoff commented Jun 23, 2021

tbg commented Jul 1, 2021

blathers-crl bot commented Feb 17, 2023

tbg commented Jul 21, 2023

syncutil: watch mutexes for deadlock #66765

syncutil: watch mutexes for deadlock #66765

Comments

tbg commented Jun 23, 2021 • edited by rickystewart Loading

joshimhoff commented Jun 23, 2021 • edited Loading

tbg commented Jun 23, 2021

joshimhoff commented Jun 23, 2021

tbg commented Jul 1, 2021

blathers-crl bot commented Feb 17, 2023

tbg commented Jul 21, 2023

tbg commented Jun 23, 2021 •

edited by rickystewart

Loading

joshimhoff commented Jun 23, 2021 •

edited

Loading