sql: flakes caused by leaked goroutine `diskHealthCheckingFile` in tests #106119

knz · 2023-07-04T16:29:26Z

Found e.g. in https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests_BazelUnitTests/10777271?hideProblemsFromDependencies=false&hideTestsFromDependencies=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true

    explain_test.go:559: Leaked goroutine: goroutine 45013 [select]:
        github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).startTicker.func1()
          github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:172 +0xdc
        created by github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).startTicker
          github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:167 +0x5d

The routine needs either to shut down cleanly, or be declared as a possible leak in package leaktest.

This typically happens when an engine isn't closed during test tear-down. The following tests have been identified (via #106195):

pkg/sql/colexec.TestExternalDistinct
pkg/sql/colexec.TestSpillingBuffer
pkg/sql/colcontainer/colcontainer_test.TestPartitionedDiskQueueSimulatedExternal

Jira issue: CRDB-29411

The text was updated successfully, but these errors were encountered:

jbowens · 2023-07-04T18:22:05Z

Typically this is caused by some higher-level code using the Engine’s filesystem after the engine has been closed, causing these goroutines to be recreated. Maybe we should detect this within the Engine and error with an explanatory error message.

In test builds, panic if any of the VFS filesystem facilities are used after the Engine is closed. Epic: none Informs cockroachdb#106119. Release note: None

jbowens · 2023-07-06T16:02:37Z

@cockroachdb/sql-queries — In #106195 I added new tracking of filesystems operations on closed Engines and files that remain open after Engine close to help surface these issues. It looks like there's a handful of disk-spilling tests that fail with leaked files. Do you mind taking a look?

https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelEssentialCi/10796158?hideProblemsFromDependencies=false&hideTestsFromDependencies=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true

nicktrav · 2023-07-07T21:36:06Z

I updated the description to mention the tests caught by #106195.

Fixes: cockroachdb#106119 Release note: None

Refactor `colexectestutils.RunTests` and variants to get the full set of `colexecop.Closers` from the constructor and to close all of these after running tests, instead of only closing the head operator. Fixes: cockroachdb#106119 Release note: None

In test builds, panic if any of the VFS filesystem facilities are used after the Engine is closed. Epic: none Informs cockroachdb#106119. Release note: None

106459: colexec: close operators at the end of colexec tests r=yuzefovich,michae2 a=michae2 Fixes: #106119 **colcontainer: close PDQ during TestPartitionedDiskQueueSimulatedExternal** Close the `PartitionedDiskQueue` at the end of `TestPartitionedDiskQueueSimulatedExternal/HashJoin`. Release note: None --- **colexecutils: close SpillingBuffer during TestSpillingBuffer** Release note: None --- **colexec: make some operators implement colexecop.Closer interface** This commit makes the following adjustments that fix "leaked files" in the disk-spilling tests: - `orderedDistinct` now implements `colexecop.Closer` (needed for `TestExternalDistinct`) - `orderedAggregator` and `distinctChainOps` now implement `colexecop.Closer` (needed for `TestExternalHashAggregator`) - `TestCrossJoiner` is adjusted to explicitly close all closers when non-empty ON expression is used (in this case we have selection and projection operators on top of the cross joiner which currently don't implement the `Closer` interface). Release note: None Co-authored-by: Michael Erickson <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>

In test builds, panic if any of the VFS filesystem facilities are used after the Engine is closed. Epic: none Informs cockroachdb#106119. Release note: None

106195: storage: ensure no filesystem use after close in test builds r=jbowens a=jbowens In test builds, panic if any of the VFS filesystem facilities are used after the Engine is closed. Epic: none Informs #106119. Resolves #110293. Release note: None 110576: kvserver: add kill switch for scheduled rangefeed processor r=erikgrinaker a=aliher1911 This commit adds env var `COCKROACH_RANGEFEED_DISABLE_SCHEDULER` which is preventing creation of scheduler based rangefeed processors forcing replicas to always create legacy processors. This feature allows nodes to be recovered if scheduler based processor failes to process requests. Epic: none Fixes: #110559 Release note: None 110625: sql: fix cleanup of not exhausted pausable portals in some cases r=yuzefovich a=yuzefovich **execinfra: relax RowSource.ConsumerClosed contract** This commit adjusts RowSource.ConsumerClosed contract so that it's possible for this method to be called multiple times. Previously, vast majority of implementations already supported that (and some actually assumed that), but there were a couple that would result in a panic or an error if called multiple times - those have been relaxed. This is needed for the follow-up commit which will introduce an unconditional call of this method on the "head" processor of the flow (needed for pausable portals execution model). Release note: None **sql: fix cleanup of not exhausted pausable portals in some cases** Previously, if we executed a flow (either vectorized or row-based) that contains a row-by-row processor in "pausable portal" model and didn't fully exhaust that flow (meaning that it could still produce more rows), this could result in an error when closing the portal because that processor might not be closed properly. Vectorized operators aren't affected by this, but row-by-row processors effectively require `ConsumerClosed` method to be called on them, and this might not happen. In particular, it was assumed that this method would get called by `execinfra.Run` loop, but for pausable portal model we needed to adjust it so that the loop could be exited (and then re-entered) when switching to another portal; thus, we lost the guarantee that `ConsumerClosed` is always called. This commit fixes this problem by teaching both vectorized and row-based flows to always call `ConsumerClosed` on their "head" processor on `Cleanup` (which is called when closing the portal). Now, `Cleanup` calls `ConsumerClosed` unconditionally (when the flow _might_ be running in the pausable portal mode), and this is safe given the interface adjustment in the previous commit. Fixes: #110486 Release note (bug fix): Previously, when evaluating some queries (most likely with a lookup join) via "multiple active portals" execution mode (preview feature that can be enabled via `multiple_active_portals_enabled` session variable) CockroachDB could encounter an internal error like `unexpected 40960 leftover bytes` if that portal wasn't fully consumed. This is now fixed. Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: Oleg Afanasyev <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>

Close the `PartitionedDiskQueue` at the end of `TestPartitionedDiskQueueSimulatedExternal/HashJoin`. Fixes: #106119 Release note: None

Fixes: #106119 Release note: None

knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team labels Jul 4, 2023

jbowens mentioned this issue Jul 5, 2023

storage: ensure no filesystem use after close in test builds #106195

Merged

blathers-crl bot added the T-sql-queries SQL Queries Team label Jul 6, 2023

michae2 self-assigned this Jul 6, 2023

nicktrav added the branch-master Failures and bugs on the master branch. label Jul 7, 2023

michae2 mentioned this issue Jul 7, 2023

colexec: close disk resources in tests #106459

Merged

mgartner added this to SQL Queries Jul 24, 2023

mgartner moved this to Active in SQL Queries Jul 24, 2023

jbowens changed the title ~~storage: flakes caused by leaked goroutine diskHealthCheckingFile in tests~~ sql: flakes caused by leaked goroutine diskHealthCheckingFile in tests Jul 28, 2023

jbowens removed the T-storage Storage Team label Jul 28, 2023

michae2 added the A-sql-vec SQL vectorized engine label Jul 28, 2023

michae2 mentioned this issue Aug 21, 2023

sql: TestExplainRedact failed #109115

Closed

yuzefovich pushed a commit to michae2/cockroach that referenced this issue Sep 8, 2023

colexecutils: close SpillingBuffer during TestSpillingBuffer

ef09966

Fixes: cockroachdb#106119 Release note: None

craig bot closed this as completed in 73487e1 Sep 9, 2023

github-project-automation bot moved this from Active to Done in SQL Queries Sep 9, 2023

yuzefovich mentioned this issue Sep 19, 2023

fakedist-disk_test: TestCCLLogic_partitioning_enum failed -- leaked goroutine #110797

Closed

blathers-crl bot pushed a commit that referenced this issue Oct 13, 2023

colcontainer: close PDQ during TestPartitionedDiskQueueSimulatedExternal

7d91b44

Close the `PartitionedDiskQueue` at the end of `TestPartitionedDiskQueueSimulatedExternal/HashJoin`. Fixes: #106119 Release note: None

blathers-crl bot pushed a commit that referenced this issue Oct 13, 2023

colexecutils: close SpillingBuffer during TestSpillingBuffer

ce625c5

Fixes: #106119 Release note: None

blathers-crl bot mentioned this issue Oct 13, 2023

release-23.1: colexec: close disk resources in tests #112341

Closed

jbowens added this to [Deprecated] Storage Jun 4, 2024

jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: flakes caused by leaked goroutine `diskHealthCheckingFile` in tests #106119

sql: flakes caused by leaked goroutine `diskHealthCheckingFile` in tests #106119

knz commented Jul 4, 2023 •

edited by yuzefovich

Loading

jbowens commented Jul 4, 2023

jbowens commented Jul 6, 2023 •

edited

Loading

nicktrav commented Jul 7, 2023

sql: flakes caused by leaked goroutine diskHealthCheckingFile in tests #106119

sql: flakes caused by leaked goroutine diskHealthCheckingFile in tests #106119

Comments

knz commented Jul 4, 2023 • edited by yuzefovich Loading

jbowens commented Jul 4, 2023

jbowens commented Jul 6, 2023 • edited Loading

nicktrav commented Jul 7, 2023

sql: flakes caused by leaked goroutine `diskHealthCheckingFile` in tests #106119

sql: flakes caused by leaked goroutine `diskHealthCheckingFile` in tests #106119

knz commented Jul 4, 2023 •

edited by yuzefovich

Loading

jbowens commented Jul 6, 2023 •

edited

Loading