-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: flakes caused by leaked goroutine diskHealthCheckingFile
in tests
#106119
Comments
Typically this is caused by some higher-level code using the Engine’s filesystem after the engine has been closed, causing these goroutines to be recreated. Maybe we should detect this within the Engine and error with an explanatory error message. |
In test builds, panic if any of the VFS filesystem facilities are used after the Engine is closed. Epic: none Informs cockroachdb#106119. Release note: None
@cockroachdb/sql-queries — In #106195 I added new tracking of filesystems operations on closed Engines and files that remain open after Engine close to help surface these issues. It looks like there's a handful of disk-spilling tests that fail with leaked files. Do you mind taking a look? |
I updated the description to mention the tests caught by #106195. |
diskHealthCheckingFile
in testsdiskHealthCheckingFile
in tests
Fixes: cockroachdb#106119 Release note: None
Refactor `colexectestutils.RunTests` and variants to get the full set of `colexecop.Closers` from the constructor and to close all of these after running tests, instead of only closing the head operator. Fixes: cockroachdb#106119 Release note: None
In test builds, panic if any of the VFS filesystem facilities are used after the Engine is closed. Epic: none Informs cockroachdb#106119. Release note: None
106459: colexec: close operators at the end of colexec tests r=yuzefovich,michae2 a=michae2 Fixes: #106119 **colcontainer: close PDQ during TestPartitionedDiskQueueSimulatedExternal** Close the `PartitionedDiskQueue` at the end of `TestPartitionedDiskQueueSimulatedExternal/HashJoin`. Release note: None --- **colexecutils: close SpillingBuffer during TestSpillingBuffer** Release note: None --- **colexec: make some operators implement colexecop.Closer interface** This commit makes the following adjustments that fix "leaked files" in the disk-spilling tests: - `orderedDistinct` now implements `colexecop.Closer` (needed for `TestExternalDistinct`) - `orderedAggregator` and `distinctChainOps` now implement `colexecop.Closer` (needed for `TestExternalHashAggregator`) - `TestCrossJoiner` is adjusted to explicitly close all closers when non-empty ON expression is used (in this case we have selection and projection operators on top of the cross joiner which currently don't implement the `Closer` interface). Release note: None Co-authored-by: Michael Erickson <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>
In test builds, panic if any of the VFS filesystem facilities are used after the Engine is closed. Epic: none Informs cockroachdb#106119. Release note: None
In test builds, panic if any of the VFS filesystem facilities are used after the Engine is closed. Epic: none Informs cockroachdb#106119. Release note: None
In test builds, panic if any of the VFS filesystem facilities are used after the Engine is closed. Epic: none Informs cockroachdb#106119. Release note: None
106195: storage: ensure no filesystem use after close in test builds r=jbowens a=jbowens In test builds, panic if any of the VFS filesystem facilities are used after the Engine is closed. Epic: none Informs #106119. Resolves #110293. Release note: None 110576: kvserver: add kill switch for scheduled rangefeed processor r=erikgrinaker a=aliher1911 This commit adds env var `COCKROACH_RANGEFEED_DISABLE_SCHEDULER` which is preventing creation of scheduler based rangefeed processors forcing replicas to always create legacy processors. This feature allows nodes to be recovered if scheduler based processor failes to process requests. Epic: none Fixes: #110559 Release note: None 110625: sql: fix cleanup of not exhausted pausable portals in some cases r=yuzefovich a=yuzefovich **execinfra: relax RowSource.ConsumerClosed contract** This commit adjusts RowSource.ConsumerClosed contract so that it's possible for this method to be called multiple times. Previously, vast majority of implementations already supported that (and some actually assumed that), but there were a couple that would result in a panic or an error if called multiple times - those have been relaxed. This is needed for the follow-up commit which will introduce an unconditional call of this method on the "head" processor of the flow (needed for pausable portals execution model). Release note: None **sql: fix cleanup of not exhausted pausable portals in some cases** Previously, if we executed a flow (either vectorized or row-based) that contains a row-by-row processor in "pausable portal" model and didn't fully exhaust that flow (meaning that it could still produce more rows), this could result in an error when closing the portal because that processor might not be closed properly. Vectorized operators aren't affected by this, but row-by-row processors effectively require `ConsumerClosed` method to be called on them, and this might not happen. In particular, it was assumed that this method would get called by `execinfra.Run` loop, but for pausable portal model we needed to adjust it so that the loop could be exited (and then re-entered) when switching to another portal; thus, we lost the guarantee that `ConsumerClosed` is always called. This commit fixes this problem by teaching both vectorized and row-based flows to always call `ConsumerClosed` on their "head" processor on `Cleanup` (which is called when closing the portal). Now, `Cleanup` calls `ConsumerClosed` unconditionally (when the flow _might_ be running in the pausable portal mode), and this is safe given the interface adjustment in the previous commit. Fixes: #110486 Release note (bug fix): Previously, when evaluating some queries (most likely with a lookup join) via "multiple active portals" execution mode (preview feature that can be enabled via `multiple_active_portals_enabled` session variable) CockroachDB could encounter an internal error like `unexpected 40960 leftover bytes` if that portal wasn't fully consumed. This is now fixed. Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: Oleg Afanasyev <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>
Close the `PartitionedDiskQueue` at the end of `TestPartitionedDiskQueueSimulatedExternal/HashJoin`. Fixes: #106119 Release note: None
Fixes: #106119 Release note: None
Found e.g. in https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests_BazelUnitTests/10777271?hideProblemsFromDependencies=false&hideTestsFromDependencies=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true
The routine needs either to shut down cleanly, or be declared as a possible leak in package
leaktest
.This typically happens when an engine isn't closed during test tear-down. The following tests have been identified (via #106195):
pkg/sql/colexec.TestExternalDistinct
pkg/sql/colexec.TestSpillingBuffer
pkg/sql/colcontainer/colcontainer_test.TestPartitionedDiskQueueSimulatedExternal
Jira issue: CRDB-29411
The text was updated successfully, but these errors were encountered: