sql: byte monitor panic when a hard(ungraceful) shutdown catches up with a graceful drain #101297

knz · 2023-04-12T11:17:29Z

Describe the problem

While investigating #100989 I created the following scenario:

SQL graceful shutdown starts. This causes in the background the SQL statement stats provider to start persisting txn/stmt stats.
while the above is ongoing, the test triggers a hard stop, by quiescing the stopper + canceling the graceful drain context. We can confirm in logs that it causes the ongoing stmt/txn persistence queries to be canceled.
when all stopper tasks have quiesced, this finalizer in stats provider is called by stopper (from sql > sqlstats > persistedsqlstats > provider.go > - (*PersistedSQLStats).Start):

  stopper.AddCloser(stop.CloserFn(func() {
    s.cfg.InternalExecutorMonitor.Stop(ctx)
  }))

And this panics:

*
* ERROR: a panic has occurred!
* internal SQL executor: unexpected 122880 leftover bytes
* (1) attached stack trace
*   -- stack trace:
*   | runtime.gopanic
*   |   GOROOT/src/runtime/panic.go:884
*   | [...repeated from below...]
* Wraps: (2) attached stack trace
*   -- stack trace:
*   | github.com/cockroachdb/cockroach/pkg/util/log/logcrash.ReportOrPanic
*   |   github.com/cockroachdb/cockroach/pkg/util/log/logcrash/crash_reporting.go:390
*   | github.com/cockroachdb/cockroach/pkg/util/mon.(*BytesMonitor).doStop
*   |   github.com/cockroachdb/cockroach/pkg/util/mon/bytes_usage.go:592
*   | github.com/cockroachdb/cockroach/pkg/util/mon.(*BytesMonitor).Stop
*   |   github.com/cockroachdb/cockroach/pkg/util/mon/bytes_usage.go:565
*   | github.com/cockroachdb/cockroach/pkg/sql/sqlstats/persistedsqlstats.(*PersistedSQLStats).Start.func1
*   |   github.com/cockroachdb/cockroach/pkg/sql/sqlstats/persistedsqlstats/pkg/sql/sqlstats/persistedsqlstats/provider.go:112
*   | github.com/cockroachdb/cockroach/pkg/util/stop.CloserFn.Close
*   |   github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:107
*   | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Stop
*   |   github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:532

This bug is relatively hard to trigger because in most cases, a server is shut down either only using a graceful drain, or only using a hard stop, but not both.

TestServerStartStop happens to do this and this is where this problem was caught.

I am to work around this by changing mon.Stop to mon.EmergencyStop in that specific case. This papers over the issue and prevents the panic from occurring. However, this is not a good long term fix because it disables the leak detection code, which we still would like to use.

To Reproduce

Take the code from #101296. Remove the last commit which replaces mon.Stop by mon.EmergencyStop in sql > sqlstats > persistedsqlstats > provider.go > - (*PersistedSQLStats).Start. Then stress test TestServerStartStop. Recommended flags:

dev test //pkg/ccl/serverccl -f TestServerStartStop --verbose --stress --test-args='-vmodule=bytes_usage=2'

Expected behavior

The abrupt termination of ongoing SQL stats queries during shutdown should still result in an orderly cleanup of the memory monitors.

Jira issue: CRDB-26896

The text was updated successfully, but these errors were encountered:

knz · 2023-04-12T13:45:00Z

I wonder if this is related to #100761 ?

knz · 2023-04-12T13:47:36Z

cc @mgartner for triage

When a hard stop (e.g. test exists) catches up with an ongoing graceful drain, some bytes do not get released properly in a memory monitor. This triggers a panic during shutdown, since the byte monitor code verifies that all allocated bytes have been released. This bug is relatively hard to trigger because in most cases, a server is shut down either only using a graceful drain, or only using a hard stop, but not both. `TestServerStartStop` happens to do this and this is where this problem was caught. We are now tracking it as issue cockroachdb#101297. Until that issue is fixed, this commit papers over the problem by removing the assertion in the byte monitor. Release note: None

97071: sql/importer: wait for any pending ingestion to complete before rolling back r=dt a=dt This change adds a new method the job registry, allowing any ingestion activity related to a given job ID to be indicated to the job registry, including ingestion activity that is being performed by a processor that is being run on this node on behalf of a job being coordinated by some other node. The registry also exposes a method to query if a given job has indicated to the registry on a given node that it is engaged in ingestion activity on that node. In conjunction, these APIs can allow a job to indicate when and where it is running ingestion tasks and for inspection of this information by other tasks, including potentially subsequent retries or phases of the same job. Implementation of such usages is left to a future change. This adds a new helper utility that can be used to check if a given job has indicated that it is currently performing ingestion work on any node using the job Registry's MarkAsIngesting API. This is done by running a distSQL flow on all nodes that invokes a new processor which simply reads this registry API, returning an error if the given job ID is ingesting and no error if it is not. This utility is intended for use when a job needs to positively verify that ingestion activity for a given job -- typically itself in a previous phase or retry -- has ceases before moving on to a new phase or retry. Typically, a job only moves on to a new phase after the flow running its ingestion activity returns after every ingesting processor exits. However in some rare edge cases, it is possible for this not to be the case: a very slow, delayed ingestion could still be flushing after the coordinator for the job terminated and the job was adopted on a different node. In such a case, that new execution of the job, potentially now in a reverting phase, could race with this delayed ingestion. In the case of a revert, such a race could lead to incomplete cleanup by the revert, as data may be added by the delayed ingestion after the reverting phase processes a given span. Integration of this new utility into jobs that require it is left to a future change. Release note (bug fix): Fix a potential bug whereby a failed or cancelled IMPORT could in some cases leave some of the imported rows behind after it was cancelled, in the rare event that the writing processes were slow enough to continue writing after the cleanup process started. Epic: none. 101296: server: fix a rare deadlock during tenant server shutdown (deflakes TestServerStartStop) r=JeffSwenson a=knz Two patches in this PR. ---- Fixes #100989. Prior to this PR, there was a rare race condition when a hard shutdown of the KV node (via `stopper.Stop`) would catch up with a graceful shutdown of a secondary tenant server (e.g. via `ALTER TENANT STOP SERVICE`) and cause a deadlock. This would likely never happen on a running server, because there a node shutdown/restart would wait for the graceful drain to complete, and in that case there is no deadlock. The issue is more prevalent in tests however and may cause the tests to time out. This PR fixes the root cause of the issue. The actual fix is in the 8th commit; all the commits before that are cosmetic or helps to debugging. In addition, the fix also revealed a bug in the handling of memory monitors upon ungraceful termination. This will be tracked in #101297. ---- Fixes #101403. Prior to this patch, the task that was responsible for propagating a graceful drain captured a variable supposed to reference the tenant server, but before this variable was assigned. This created a possible race condition, in the unlikely case case where the server startup would fail _and_ a graceful drain would be requested, concurrently. This patch fixes it by only starting to propagate graceful drains after the server is fully initialized. (But before it starts accepting clients, so that we don't create a window of time where clients can connects but graceful drains don't propagate.) This is also achieved by extracting the two shutdown tasks into separate functions, to clarify the flow of parameters. 101373: dev: fix unit test r=rail a=rickystewart Mock out the call to `UserCacheDir` here. Epic: CRDB-17165 Release note: None 101387: ci: add script for private roachtest nightly r=rail a=rickystewart Part of: DEVINFHD-809. Epic: none Release note: None Co-authored-by: David Taylor <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>

When a hard stop (e.g. test exists) catches up with an ongoing graceful drain, some bytes do not get released properly in a memory monitor. This triggers a panic during shutdown, since the byte monitor code verifies that all allocated bytes have been released. This bug is relatively hard to trigger because in most cases, a server is shut down either only using a graceful drain, or only using a hard stop, but not both. `TestServerStartStop` happens to do this and this is where this problem was caught. We are now tracking it as issue cockroachdb#101297. Until that issue is fixed, this commit papers over the problem by removing the assertion in the byte monitor. Release note: None

knz added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Apr 12, 2023

knz mentioned this issue Apr 12, 2023

server: fix a rare deadlock during tenant server shutdown (deflakes TestServerStartStop) #101296

Merged

knz added the T-sql-queries SQL Queries Team label Apr 12, 2023

knz added the A-sql-memmon SQL memory monitoring label Apr 12, 2023

mgartner added this to SQL Queries Jul 24, 2023

mgartner moved this to Bugs to Fix in SQL Queries Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: byte monitor panic when a hard(ungraceful) shutdown catches up with a graceful drain #101297

sql: byte monitor panic when a hard(ungraceful) shutdown catches up with a graceful drain #101297

knz commented Apr 12, 2023 •

edited

Loading

knz commented Apr 12, 2023

knz commented Apr 12, 2023

sql: byte monitor panic when a hard(ungraceful) shutdown catches up with a graceful drain #101297

sql: byte monitor panic when a hard(ungraceful) shutdown catches up with a graceful drain #101297

Comments

knz commented Apr 12, 2023 • edited Loading

knz commented Apr 12, 2023

knz commented Apr 12, 2023

knz commented Apr 12, 2023 •

edited

Loading