-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: byte monitor panic when a hard(ungraceful) shutdown catches up with a graceful drain #101297
Labels
A-sql-memmon
SQL memory monitoring
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-sql-queries
SQL Queries Team
Comments
knz
added
the
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
label
Apr 12, 2023
I wonder if this is related to #100761 ? |
cc @mgartner for triage |
knz
added a commit
to knz/cockroach
that referenced
this issue
Apr 13, 2023
When a hard stop (e.g. test exists) catches up with an ongoing graceful drain, some bytes do not get released properly in a memory monitor. This triggers a panic during shutdown, since the byte monitor code verifies that all allocated bytes have been released. This bug is relatively hard to trigger because in most cases, a server is shut down either only using a graceful drain, or only using a hard stop, but not both. `TestServerStartStop` happens to do this and this is where this problem was caught. We are now tracking it as issue cockroachdb#101297. Until that issue is fixed, this commit papers over the problem by removing the assertion in the byte monitor. Release note: None
knz
added a commit
to knz/cockroach
that referenced
this issue
Apr 13, 2023
When a hard stop (e.g. test exists) catches up with an ongoing graceful drain, some bytes do not get released properly in a memory monitor. This triggers a panic during shutdown, since the byte monitor code verifies that all allocated bytes have been released. This bug is relatively hard to trigger because in most cases, a server is shut down either only using a graceful drain, or only using a hard stop, but not both. `TestServerStartStop` happens to do this and this is where this problem was caught. We are now tracking it as issue cockroachdb#101297. Until that issue is fixed, this commit papers over the problem by removing the assertion in the byte monitor. Release note: None
craig bot
pushed a commit
that referenced
this issue
Apr 13, 2023
97071: sql/importer: wait for any pending ingestion to complete before rolling back r=dt a=dt This change adds a new method the job registry, allowing any ingestion activity related to a given job ID to be indicated to the job registry, including ingestion activity that is being performed by a processor that is being run on this node on behalf of a job being coordinated by some other node. The registry also exposes a method to query if a given job has indicated to the registry on a given node that it is engaged in ingestion activity on that node. In conjunction, these APIs can allow a job to indicate when and where it is running ingestion tasks and for inspection of this information by other tasks, including potentially subsequent retries or phases of the same job. Implementation of such usages is left to a future change. This adds a new helper utility that can be used to check if a given job has indicated that it is currently performing ingestion work on any node using the job Registry's MarkAsIngesting API. This is done by running a distSQL flow on all nodes that invokes a new processor which simply reads this registry API, returning an error if the given job ID is ingesting and no error if it is not. This utility is intended for use when a job needs to positively verify that ingestion activity for a given job -- typically itself in a previous phase or retry -- has ceases before moving on to a new phase or retry. Typically, a job only moves on to a new phase after the flow running its ingestion activity returns after every ingesting processor exits. However in some rare edge cases, it is possible for this not to be the case: a very slow, delayed ingestion could still be flushing after the coordinator for the job terminated and the job was adopted on a different node. In such a case, that new execution of the job, potentially now in a reverting phase, could race with this delayed ingestion. In the case of a revert, such a race could lead to incomplete cleanup by the revert, as data may be added by the delayed ingestion after the reverting phase processes a given span. Integration of this new utility into jobs that require it is left to a future change. Release note (bug fix): Fix a potential bug whereby a failed or cancelled IMPORT could in some cases leave some of the imported rows behind after it was cancelled, in the rare event that the writing processes were slow enough to continue writing after the cleanup process started. Epic: none. 101296: server: fix a rare deadlock during tenant server shutdown (deflakes TestServerStartStop) r=JeffSwenson a=knz Two patches in this PR. ---- Fixes #100989. Prior to this PR, there was a rare race condition when a hard shutdown of the KV node (via `stopper.Stop`) would catch up with a graceful shutdown of a secondary tenant server (e.g. via `ALTER TENANT STOP SERVICE`) and cause a deadlock. This would likely never happen on a running server, because there a node shutdown/restart would wait for the graceful drain to complete, and in that case there is no deadlock. The issue is more prevalent in tests however and may cause the tests to time out. This PR fixes the root cause of the issue. The actual fix is in the 8th commit; all the commits before that are cosmetic or helps to debugging. In addition, the fix also revealed a bug in the handling of memory monitors upon ungraceful termination. This will be tracked in #101297. ---- Fixes #101403. Prior to this patch, the task that was responsible for propagating a graceful drain captured a variable supposed to reference the tenant server, but before this variable was assigned. This created a possible race condition, in the unlikely case case where the server startup would fail _and_ a graceful drain would be requested, concurrently. This patch fixes it by only starting to propagate graceful drains after the server is fully initialized. (But before it starts accepting clients, so that we don't create a window of time where clients can connects but graceful drains don't propagate.) This is also achieved by extracting the two shutdown tasks into separate functions, to clarify the flow of parameters. 101373: dev: fix unit test r=rail a=rickystewart Mock out the call to `UserCacheDir` here. Epic: CRDB-17165 Release note: None 101387: ci: add script for private roachtest nightly r=rail a=rickystewart Part of: DEVINFHD-809. Epic: none Release note: None Co-authored-by: David Taylor <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>
knz
added a commit
to knz/cockroach
that referenced
this issue
Apr 13, 2023
When a hard stop (e.g. test exists) catches up with an ongoing graceful drain, some bytes do not get released properly in a memory monitor. This triggers a panic during shutdown, since the byte monitor code verifies that all allocated bytes have been released. This bug is relatively hard to trigger because in most cases, a server is shut down either only using a graceful drain, or only using a hard stop, but not both. `TestServerStartStop` happens to do this and this is where this problem was caught. We are now tracking it as issue cockroachdb#101297. Until that issue is fixed, this commit papers over the problem by removing the assertion in the byte monitor. Release note: None
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-sql-memmon
SQL memory monitoring
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-sql-queries
SQL Queries Team
Describe the problem
While investigating #100989 I created the following scenario:
sql > sqlstats > persistedsqlstats > provider.go > - (*PersistedSQLStats).Start
):And this panics:
This bug is relatively hard to trigger because in most cases, a server is shut down either only using a graceful drain, or only using a hard stop, but not both.
TestServerStartStop
happens to do this and this is where this problem was caught.I am to work around this by changing
mon.Stop
tomon.EmergencyStop
in that specific case. This papers over the issue and prevents the panic from occurring. However, this is not a good long term fix because it disables the leak detection code, which we still would like to use.To Reproduce
Take the code from #101296. Remove the last commit which replaces
mon.Stop
bymon.EmergencyStop
insql > sqlstats > persistedsqlstats > provider.go > - (*PersistedSQLStats).Start
. Then stress testTestServerStartStop
. Recommended flags:Expected behavior
The abrupt termination of ongoing SQL stats queries during shutdown should still result in an orderly cleanup of the memory monitors.
Jira issue: CRDB-26896
The text was updated successfully, but these errors were encountered: