Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: byte monitor panic when a hard(ungraceful) shutdown catches up with a graceful drain #101297

Open
knz opened this issue Apr 12, 2023 · 2 comments
Labels
A-sql-memmon SQL memory monitoring C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-queries SQL Queries Team

Comments

@knz
Copy link
Contributor

knz commented Apr 12, 2023

Describe the problem

While investigating #100989 I created the following scenario:

  1. SQL graceful shutdown starts. This causes in the background the SQL statement stats provider to start persisting txn/stmt stats.
  2. while the above is ongoing, the test triggers a hard stop, by quiescing the stopper + canceling the graceful drain context. We can confirm in logs that it causes the ongoing stmt/txn persistence queries to be canceled.
  3. when all stopper tasks have quiesced, this finalizer in stats provider is called by stopper (from sql > sqlstats > persistedsqlstats > provider.go > - (*PersistedSQLStats).Start):
  stopper.AddCloser(stop.CloserFn(func() {
    s.cfg.InternalExecutorMonitor.Stop(ctx)
  }))

And this panics:

*
* ERROR: a panic has occurred!
* internal SQL executor: unexpected 122880 leftover bytes
* (1) attached stack trace
*   -- stack trace:
*   | runtime.gopanic
*   |   GOROOT/src/runtime/panic.go:884
*   | [...repeated from below...]
* Wraps: (2) attached stack trace
*   -- stack trace:
*   | github.com/cockroachdb/cockroach/pkg/util/log/logcrash.ReportOrPanic
*   |   github.com/cockroachdb/cockroach/pkg/util/log/logcrash/crash_reporting.go:390
*   | github.com/cockroachdb/cockroach/pkg/util/mon.(*BytesMonitor).doStop
*   |   github.com/cockroachdb/cockroach/pkg/util/mon/bytes_usage.go:592
*   | github.com/cockroachdb/cockroach/pkg/util/mon.(*BytesMonitor).Stop
*   |   github.com/cockroachdb/cockroach/pkg/util/mon/bytes_usage.go:565
*   | github.com/cockroachdb/cockroach/pkg/sql/sqlstats/persistedsqlstats.(*PersistedSQLStats).Start.func1
*   |   github.com/cockroachdb/cockroach/pkg/sql/sqlstats/persistedsqlstats/pkg/sql/sqlstats/persistedsqlstats/provider.go:112
*   | github.com/cockroachdb/cockroach/pkg/util/stop.CloserFn.Close
*   |   github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:107
*   | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Stop
*   |   github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:532

This bug is relatively hard to trigger because in most cases, a server is shut down either only using a graceful drain, or only using a hard stop, but not both.

TestServerStartStop happens to do this and this is where this problem was caught.

I am to work around this by changing mon.Stop to mon.EmergencyStop in that specific case. This papers over the issue and prevents the panic from occurring. However, this is not a good long term fix because it disables the leak detection code, which we still would like to use.

To Reproduce

Take the code from #101296. Remove the last commit which replaces mon.Stop by mon.EmergencyStop in sql > sqlstats > persistedsqlstats > provider.go > - (*PersistedSQLStats).Start. Then stress test TestServerStartStop. Recommended flags:

dev test //pkg/ccl/serverccl -f TestServerStartStop --verbose --stress --test-args='-vmodule=bytes_usage=2'

Expected behavior

The abrupt termination of ongoing SQL stats queries during shutdown should still result in an orderly cleanup of the memory monitors.

Jira issue: CRDB-26896

@knz knz added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Apr 12, 2023
@knz knz added the T-sql-queries SQL Queries Team label Apr 12, 2023
@knz knz added the A-sql-memmon SQL memory monitoring label Apr 12, 2023
@knz
Copy link
Contributor Author

knz commented Apr 12, 2023

I wonder if this is related to #100761 ?

@knz
Copy link
Contributor Author

knz commented Apr 12, 2023

cc @mgartner for triage

knz added a commit to knz/cockroach that referenced this issue Apr 13, 2023
When a hard stop (e.g. test exists) catches up with an ongoing
graceful drain, some bytes do not get released properly in a memory
monitor. This triggers a panic during shutdown, since the byte monitor
code verifies that all allocated bytes have been released.

This bug is relatively hard to trigger because in most cases,
a server is shut down either only using a graceful drain, or only
using a hard stop, but not both.

`TestServerStartStop` happens to do this and this is where this
problem was caught. We are now tracking it as issue cockroachdb#101297.

Until that issue is fixed, this commit papers over the problem by
removing the assertion in the byte monitor.

Release note: None
knz added a commit to knz/cockroach that referenced this issue Apr 13, 2023
When a hard stop (e.g. test exists) catches up with an ongoing
graceful drain, some bytes do not get released properly in a memory
monitor. This triggers a panic during shutdown, since the byte monitor
code verifies that all allocated bytes have been released.

This bug is relatively hard to trigger because in most cases,
a server is shut down either only using a graceful drain, or only
using a hard stop, but not both.

`TestServerStartStop` happens to do this and this is where this
problem was caught. We are now tracking it as issue cockroachdb#101297.

Until that issue is fixed, this commit papers over the problem by
removing the assertion in the byte monitor.

Release note: None
craig bot pushed a commit that referenced this issue Apr 13, 2023
97071: sql/importer: wait for any pending ingestion to complete before rolling back r=dt a=dt

This change adds a new method the job registry, allowing any ingestion
activity related to a given job ID to be indicated to the job registry,
including ingestion activity that is being performed by a processor that
is being run on this node on behalf of a job being coordinated by some
other node.

The registry also exposes a method to query if a given job has indicated
to the registry on a given node that it is engaged in ingestion activity
on that node.

In conjunction, these APIs can allow a job to indicate when and where it
is running ingestion tasks and for inspection of this information by
other tasks, including potentially subsequent retries or phases of the
same job. Implementation of such usages is left to a future change.

This adds a new helper utility that can be used to check if a given job
has indicated that it is currently performing ingestion work on any node
using the job Registry's MarkAsIngesting API.

This is done by running a distSQL flow on all nodes that invokes a new
processor which simply reads this registry API, returning an error if
the given job ID is ingesting and no error if it is not.

This utility is intended for use when a job needs to positively verify
that ingestion activity for a given job -- typically itself in a previous
phase or retry -- has ceases before moving on to a new phase or retry.

Typically, a job only moves on to a new phase after the flow running its
ingestion activity returns after every ingesting processor exits. However
in some rare edge cases, it is possible for this not to be the case: a
very slow, delayed ingestion could still be flushing after the coordinator
for the job terminated and the job was adopted on a different node. In
such a case, that new execution of the job, potentially now in a reverting
phase, could race with this delayed ingestion. In the case of a revert,
such a race could lead to incomplete cleanup by the revert, as data may
be added by the delayed ingestion after the reverting phase processes a
given span.

Integration of this new utility into jobs that require it is left to a
future change.

Release note (bug fix): Fix a potential bug whereby a failed or
cancelled IMPORT could in some cases leave some of the imported rows
behind after it was cancelled, in the rare event that the writing
processes were slow enough to continue writing after the cleanup process
started. 

Epic: none.

101296: server: fix a rare deadlock during tenant server shutdown (deflakes TestServerStartStop) r=JeffSwenson a=knz

Two patches in this PR.

----
Fixes #100989.

Prior to this PR, there was a rare race condition when a hard shutdown of the KV node (via `stopper.Stop`) would catch up with a graceful shutdown of a secondary tenant server (e.g. via `ALTER TENANT STOP SERVICE`) and cause a deadlock.

This would likely never happen on a running server, because there a node shutdown/restart would wait for the graceful drain to complete, and in that case there is no deadlock. The issue is more prevalent in tests however and may cause the tests to time out.

This PR fixes the root cause of the issue. The actual fix is in the 8th commit; all the commits before that are cosmetic or helps to debugging.

In addition, the fix also revealed a bug in the handling of memory monitors upon ungraceful termination. This will be tracked in #101297.

----
Fixes #101403.

Prior to this patch, the task that was responsible for propagating a
graceful drain captured a variable supposed to reference the tenant
server, but before this variable was assigned.

This created a possible race condition, in the unlikely case case
where the server startup would fail _and_ a graceful drain would be
requested, concurrently.

This patch fixes it by only starting to propagate graceful drains
after the server is fully initialized. (But before it starts accepting
clients, so that we don't create a window of time where clients can
connects but graceful drains don't propagate.)

This is also achieved by extracting the two shutdown tasks into
separate functions, to clarify the flow of parameters.


101373: dev: fix unit test r=rail a=rickystewart

Mock out the call to `UserCacheDir` here.

Epic: CRDB-17165
Release note: None

101387: ci: add script for private roachtest nightly r=rail a=rickystewart

Part of: DEVINFHD-809.
Epic: none
Release note: None

Co-authored-by: David Taylor <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
knz added a commit to knz/cockroach that referenced this issue Apr 13, 2023
When a hard stop (e.g. test exists) catches up with an ongoing
graceful drain, some bytes do not get released properly in a memory
monitor. This triggers a panic during shutdown, since the byte monitor
code verifies that all allocated bytes have been released.

This bug is relatively hard to trigger because in most cases,
a server is shut down either only using a graceful drain, or only
using a hard stop, but not both.

`TestServerStartStop` happens to do this and this is where this
problem was caught. We are now tracking it as issue cockroachdb#101297.

Until that issue is fixed, this commit papers over the problem by
removing the assertion in the byte monitor.

Release note: None
@mgartner mgartner moved this to Bugs to Fix in SQL Queries Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-memmon SQL memory monitoring C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-queries SQL Queries Team
Projects
Status: Bugs to Fix
Development

No branches or pull requests

1 participant