sql: closed session registry causes deadlock in connexecutor #83078

knz · 2022-06-19T10:59:35Z

While working on an unrelated PR, I observed the following behavior: a query running a new statement (under development) did not complete, but there was also no error. In fact, the conn executor had deadlocked (stack trace below).

I troubleshooted this, and I can say for certainty that the closed session registry will always cause a deadlock if a panic is encountered during execution at a txn transition (savepoint, restart, commit etc).

Here is an example of what can happen internally:

say there's some bug in a statement, whereby some memory does not get released properly at the end of an (implicit) txn
this causes the Stop() method from the memory monitor at the end of the execution (finishSQLTxn()) to report "unexpected leftover bytes" as a panic

finishSQLTxn() is recursively called, through the FSM, from txnStateTransitionsApplyWrapper():

ex.mu.Lock()
err ≔ ex.machine.ApplyWithPayload(withStatement(ex.Ctx(), ex.curStmtAST), ev, payload)
ex.mu.Unlock()

however, meanwhile (*connexecutor) run() contains a defer with the call to .serialize(), which also contains:
```
ex.mu.RLock()
defer ex.mu.RUnlock()
```

So if a panic is encountered under ApplyWithPayload() (which is about every possible state change in the FSM), the .serialize() call at the end will deadlock.

What we want to do about this

Obviously a deadlock is not great!

We can either switch over the serialization to a non-locking operation,

or maybe use a defer unlock around ApplyWithPayload.

Example stack trace

Here is the stack trace:

goroutine 1097 [semacquire]:
sync.runtime_SemacquireMutex(0xc00297eae4, 0x0, 0x0)
        GOROOT/src/runtime/sema.go:71 +0x47
sync.(*RWMutex).RLock(...)
        GOROOT/src/sync/rwmutex.go:63
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).serialize(0xc00297d900)
        github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:3028 +0xb3
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run.func1(0xc00297d900, {0x24c6da0, 0xc00193f940}, 0xc003dbbd80)
        github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1788 +0x85
panic({0x9987e0, 0xc0005a7248})
        GOROOT/src/runtime/panic.go:1038 +0x25b
github.com/cockroachdb/cockroach/pkg/util/log/logcrash.ReportOrPanic({0x24c6da0, 0xc002c14740}, 0xc00143c000, {0xdcabda, 0x20}, {0xc002e9b2d0, 0x2, 0x2})
        github.com/cockroachdb/cockroach/pkg/util/log/logcrash/crash_reporting.go:378 +0x225
github.com/cockroachdb/cockroach/pkg/util/mon.(*BytesMonitor).doStop(0xc0007f6cc0, {0x24c6da0, 0xc002c14740}, 0x1)
        github.com/cockroachdb/cockroach/pkg/util/mon/bytes_usage.go:426 +0x294
github.com/cockroachdb/cockroach/pkg/util/mon.(*BytesMonitor).Stop(...)
        github.com/cockroachdb/cockroach/pkg/util/mon/bytes_usage.go:406
github.com/cockroachdb/cockroach/pkg/sql.(*txnState).finishSQLTxn(0xc00297d990)
        github.com/cockroachdb/cockroach/pkg/sql/txn_state.go:249 +0x5d
github.com/cockroachdb/cockroach/pkg/sql.cleanupAndFinishOnError({{0x24c6e48, 0xc001b67dd0}, {0x242d7e0, 0xc000bdb4e0}, {0xb07f40, 0xc00297d990}, {0x242d600, 0xc000c8daa0}, {0x79ca60, 0xc000c8dab0}})
        github.com/cockroachdb/cockroach/pkg/sql/conn_fsm.go:510 +0xf3
github.com/cockroachdb/cockroach/pkg/util/fsm.Transitions.apply({0xc000155ad0}, {{0x24c6e48, 0xc001b67dd0}, {0x242d7e0, 0xc000bdb4e0}, {0xb07f40, 0xc00297d990}, {0x242d600, 0xc000c8daa0}, {0x79ca60, ...}})
        github.com/cockroachdb/cockroach/pkg/util/fsm/fsm.go:106 +0x142
github.com/cockroachdb/cockroach/pkg/util/fsm.(*Machine).ApplyWithPayload(...)
        github.com/cockroachdb/cockroach/pkg/util/fsm/fsm.go:135
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).txnStateTransitionsApplyWrapper(0xc00297d900, {0x242d600, 0xc000c8daa0}, {0x79ca60, 0xc000c8dab0}, {0x83171f0a0, 0xc00107bbc0}, 0xc)
        github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:2825 +0x298
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd(0xc00297d900)
        github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:2048 +0x17e5
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc00297d900, {0x24c6da0, 0xc00193f940}, 0xc002112900, {0x5400, 0x15000, 0xc002113680, 0x0, 0x0}, 0xc000bffb60)
        github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1801 +0x27a
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn(0xc000ceb100, {0x24c6da0, 0xc00193f940}, {0xc00297d900}, {0x5400, 0x15000, 0xc002113680, 0x0, 0x0}, 0xc000bffb60)
        github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:750 +0xe5
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync.func1(0xc003302a48, 0xc002c41530, {0x24c6da0, 0xc00193f940}, 0xc000bffb60, 0xc000ceb100, 0xc00107b500, {0x25613a8, 0xc00211a6c0}, 0xc0026a7920, ...)

cc @jordanlewis @maryliag for triage.

Jira issue: CRDB-16848

The text was updated successfully, but these errors were encountered:

knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-sql-execution Relating to SQL execution. regression Regression from a release. A-sql-executor SQL txn logic T-sql-execution labels Jun 19, 2022

blathers-crl bot added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) T-sql-queries SQL Queries Team labels Jun 19, 2022

knz mentioned this issue Jun 19, 2022

sql: ensure more mutexes get properly unlocked in case of panic #83083

Merged

craig bot closed this as completed in 76cbd88 Jun 22, 2022

mgartner added this to SQL Queries Jul 24, 2023

mgartner moved this to Done in SQL Queries Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: closed session registry causes deadlock in connexecutor #83078

sql: closed session registry causes deadlock in connexecutor #83078

knz commented Jun 19, 2022 •

edited

Loading

sql: closed session registry causes deadlock in connexecutor #83078

sql: closed session registry causes deadlock in connexecutor #83078

Comments

knz commented Jun 19, 2022 • edited Loading

What we want to do about this

Example stack trace

knz commented Jun 19, 2022 •

edited

Loading