Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syncutil: watch mutexes for deadlock #66765

Open
tbg opened this issue Jun 23, 2021 · 6 comments
Open

syncutil: watch mutexes for deadlock #66765

tbg opened this issue Jun 23, 2021 · 6 comments
Assignees
Labels
A-observability-inf C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) N-followup Needs followup. O-postmortem Originated from a Postmortem action item. P-3 Issues/test failures with no fix SLA quality-friday A good issue to work on on Quality Friday T-testeng TestEng Team

Comments

@tbg
Copy link
Member

tbg commented Jun 23, 2021

Is your feature request related to a problem? Please describe.

We just dealt with a deadlock due to a re-entrant RLock acquisition (#66760). Deadlocks should be weeded out by testing, no question. But we should also expose them when they happen, as restarting the node on which they occur can be an effective mitigation tool.

Describe the solution you'd like

Bare bones

  • add a gauge server.slow.mutex (name open to discussion)
  • augment syncutil.{RW,}Mutex so that each mutex is tracked by a watcher that, at regular intervals, attempts to go func() { Lock(); Unlock()}() the mutex. If this takes longer than X (sensible values might depend on the mutex but a default of 10s is likely a good start) in the sense that Lock() hasn't returned within X, increment the gauge. If it then ever manages to actually lock and then unlocks, decrement the gauge.
  • add server.slow.mutex to
    var trackedMetrics = map[string]threshold{
    // Gauges.
    "ranges.unavailable": gaugeZero,
    as a gaugeZero, so whenever it's nonzero we will print something in the logs.

Enhanced

We could give each mutex a (static, i.e. one name shared across all Replica.mu for example) name and maintain, in addition to the above, a family of histograms of the latency of Lock() for each name (with somewhat tricky semantics, we want to make sure that if we deadlock on Lock() the histogram still gets populated, we can do something here, tbd.
When a mutex deadlocks, metrics would hence show us which one.
We probably don't want to hard-code the metrics names into timeseries names, which is what our internal timeseries would want. So this would be a prometheus-only metric. There is precedent for that in the tenant metrics, so this isn't new.

Describe alternatives you've considered
There are many variations on the above. We just need to pick one and do it.

Additional context

See #66760

Jira issue: CRDB-8219

@tbg tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jun 23, 2021
@joshimhoff
Copy link
Collaborator

joshimhoff commented Jun 23, 2021

This is a great idea for a time-series metric!!

We just dealt with a deadlock due to a re-entrant RLock acquisition (#66760). Deadlocks should be weeded out by testing, no question. But we should also expose them when they happen, as restarting the node on which they occur can be an effective mitigation tool.

Let's say that the long held mutex is leading a single range to be unavailable. Can we find a way to page on the range being unavailable? I'd rather page on some bad symptom than on a mutex being held too long (could have lower priority alert on the mutex being held too long). kvprober would eventually hit the problem range, but that takes a long time depending on cluster size, as you know. Although I sometimes wonder if we should just probe a lot, esp. in serverless setting where CRL owns the host cluster rather than the customer. Can we use a mutex being held too long as a hint to initiate probing? Sounds hard / wrong as then you need to be able to tie the mutex to a range... Would the stuck applied index alert fire in case of #66760? Or would the watch-for-stuck-applied-index code be stuck due to the deadlock also?

@tbg
Copy link
Member Author

tbg commented Jun 23, 2021

Would the stuck applied index alert fire in case of #66760?

It should, if we write the code in the right way. Mutex deadlocks are particularly pernicious. They are a low-level failure that should basically never occur, as its fallout is really hard to control and because they are not recoverable (there isn't even a way to "try to acquire the mutex but if it takes too long stop doing it". So whatever control mechanism we have, if it at any point touches the mutex in a blocking way, it itself gets stuck and won't report. So you always need to offload the mutex-touching to a goroutine and then wait for that goroutine to indicate success back to some controller goroutine, which is also the pattern I suggested above.
But then a deadlock in the right mutex will prevent you from scraping the prometheus endpoint (to populate the prometheus metrics you touch the replica mutex, oops, though we can fix that too using the same pattern) so the whole idea may not be too useful in practice, at least not if we're primarily doing it for the replica mutexes.

I tend to think that it's not worth trying to let these metrics delve into debugging too much. This creates lots of engineering problems that don't pay off given how unique each deadlock scenario is. The key point is realizing quickly that there is one so that we can react by pulling stacks & cycling the node. RCA comes later, and it seldom matters which replica was hit by the deadlock as it can usually occur on any replica in general.

@joshimhoff
Copy link
Collaborator

It should, if we write the code in the right way.

Ack & thanks for the color. Also maybe the proposed slow mutex metric will be high enough signal to noise that we can page on it.

I tend to think that it's not worth trying to let these metrics delve into debugging too much.

Mhm +1.

@lunevalex lunevalex added O-postmortem Originated from a Postmortem action item. N-followup Needs followup. labels Jun 23, 2021
@tbg
Copy link
Member Author

tbg commented Jul 1, 2021

We had another incident of a known Replica.mu deadlock today (user on 21.1.3, fixed in 21.1.4). These are really bad. The node will limp along in unpredictable ways. In the instance today, all raft scheduler goroutines got glued to the mutex so the node was not meaningfully participating in replication any more. On top of detecting such issues, nodes should consider crashing, at least when some mutexes seem to be deadlocked.

@blathers-crl
Copy link

blathers-crl bot commented Feb 17, 2023

cc @cockroachdb/test-eng

@erikgrinaker erikgrinaker added the quality-friday A good issue to work on on Quality Friday label Jul 20, 2023
@tbg
Copy link
Member Author

tbg commented Jul 21, 2023

Some related POC code which I used a in a repro for a deadlock (deadlock never reproed) is here. The important bit in the approach taken in that code is that we need to know the stack of the caller when we detect the slow mutex, which isn't always the case:

r.mu.Lock()
if foo() { return }
r.mu.Unlock() // oops leaked mutex if we return above

For lots of hot mutexes the approach in #106254 is probably too costly, but maybe not and definitely not for all. For example, for raftMu it's likely fine, since raftMu is held across IO anyway in the common case.

@abarganier abarganier self-assigned this Jul 25, 2023
@srosenberg srosenberg added the P-3 Issues/test failures with no fix SLA label Nov 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-observability-inf C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) N-followup Needs followup. O-postmortem Originated from a Postmortem action item. P-3 Issues/test failures with no fix SLA quality-friday A good issue to work on on Quality Friday T-testeng TestEng Team
Projects
None yet
Development

No branches or pull requests

6 participants