Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: flakes caused by leaked goroutine diskHealthCheckingFile in tests #106119

Closed
3 tasks done
knz opened this issue Jul 4, 2023 · 3 comments · Fixed by #106459
Closed
3 tasks done

sql: flakes caused by leaked goroutine diskHealthCheckingFile in tests #106119

knz opened this issue Jul 4, 2023 · 3 comments · Fixed by #106459
Assignees
Labels
A-sql-vec SQL vectorized engine A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). T-sql-queries SQL Queries Team

Comments

@knz
Copy link
Contributor

knz commented Jul 4, 2023

Found e.g. in https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests_BazelUnitTests/10777271?hideProblemsFromDependencies=false&hideTestsFromDependencies=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true

    explain_test.go:559: Leaked goroutine: goroutine 45013 [select]:
        github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).startTicker.func1()
          github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:172 +0xdc
        created by github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).startTicker
          github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:167 +0x5d

The routine needs either to shut down cleanly, or be declared as a possible leak in package leaktest.

This typically happens when an engine isn't closed during test tear-down. The following tests have been identified (via #106195):

  • pkg/sql/colexec.TestExternalDistinct
  • pkg/sql/colexec.TestSpillingBuffer
  • pkg/sql/colcontainer/colcontainer_test.TestPartitionedDiskQueueSimulatedExternal

Jira issue: CRDB-29411

@knz knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team labels Jul 4, 2023
@jbowens
Copy link
Collaborator

jbowens commented Jul 4, 2023

Typically this is caused by some higher-level code using the Engine’s filesystem after the engine has been closed, causing these goroutines to be recreated. Maybe we should detect this within the Engine and error with an explanatory error message.

jbowens added a commit to jbowens/cockroach that referenced this issue Jul 5, 2023
In test builds, panic if any of the VFS filesystem facilities are used after
the Engine is closed.

Epic: none
Informs cockroachdb#106119.
Release note: None
@jbowens
Copy link
Collaborator

jbowens commented Jul 6, 2023

@cockroachdb/sql-queries — In #106195 I added new tracking of filesystems operations on closed Engines and files that remain open after Engine close to help surface these issues. It looks like there's a handful of disk-spilling tests that fail with leaked files. Do you mind taking a look?

https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelEssentialCi/10796158?hideProblemsFromDependencies=false&hideTestsFromDependencies=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true

@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Jul 6, 2023
@michae2 michae2 self-assigned this Jul 6, 2023
@nicktrav
Copy link
Collaborator

nicktrav commented Jul 7, 2023

I updated the description to mention the tests caught by #106195.

@nicktrav nicktrav added the branch-master Failures and bugs on the master branch. label Jul 7, 2023
@mgartner mgartner moved this to Active in SQL Queries Jul 24, 2023
@jbowens jbowens changed the title storage: flakes caused by leaked goroutine diskHealthCheckingFile in tests sql: flakes caused by leaked goroutine diskHealthCheckingFile in tests Jul 28, 2023
@jbowens jbowens removed the T-storage Storage Team label Jul 28, 2023
@michae2 michae2 added the A-sql-vec SQL vectorized engine label Jul 28, 2023
yuzefovich pushed a commit to michae2/cockroach that referenced this issue Sep 8, 2023
yuzefovich pushed a commit to michae2/cockroach that referenced this issue Sep 8, 2023
Refactor `colexectestutils.RunTests` and variants to get the full set of
`colexecop.Closers` from the constructor and to close all of these after
running tests, instead of only closing the head operator.

Fixes: cockroachdb#106119

Release note: None
yuzefovich pushed a commit to michae2/cockroach that referenced this issue Sep 8, 2023
In test builds, panic if any of the VFS filesystem facilities are used after
the Engine is closed.

Epic: none
Informs cockroachdb#106119.
Release note: None
craig bot pushed a commit that referenced this issue Sep 8, 2023
106459: colexec: close operators at the end of colexec tests r=yuzefovich,michae2 a=michae2

Fixes: #106119

**colcontainer: close PDQ during TestPartitionedDiskQueueSimulatedExternal**

Close the `PartitionedDiskQueue` at the end of `TestPartitionedDiskQueueSimulatedExternal/HashJoin`.

Release note: None

---

**colexecutils: close SpillingBuffer during TestSpillingBuffer**

Release note: None

---

**colexec: make some operators implement colexecop.Closer interface**

This commit makes the following adjustments that fix "leaked files" in
the disk-spilling tests:
- `orderedDistinct` now implements `colexecop.Closer` (needed for
`TestExternalDistinct`)
- `orderedAggregator` and `distinctChainOps` now implement
`colexecop.Closer` (needed for `TestExternalHashAggregator`)
- `TestCrossJoiner` is adjusted to explicitly close all closers when
non-empty ON expression is used (in this case we have selection and
projection operators on top of the cross joiner which currently don't
implement the `Closer` interface).

Release note: None

Co-authored-by: Michael Erickson <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
@craig craig bot closed this as completed in 73487e1 Sep 9, 2023
@github-project-automation github-project-automation bot moved this from Active to Done in SQL Queries Sep 9, 2023
jbowens added a commit to jbowens/cockroach that referenced this issue Sep 12, 2023
In test builds, panic if any of the VFS filesystem facilities are used after
the Engine is closed.

Epic: none
Informs cockroachdb#106119.
Release note: None
jbowens added a commit to jbowens/cockroach that referenced this issue Sep 14, 2023
In test builds, panic if any of the VFS filesystem facilities are used after
the Engine is closed.

Epic: none
Informs cockroachdb#106119.
Release note: None
jbowens added a commit to jbowens/cockroach that referenced this issue Sep 14, 2023
In test builds, panic if any of the VFS filesystem facilities are used after
the Engine is closed.

Epic: none
Informs cockroachdb#106119.
Release note: None
craig bot pushed a commit that referenced this issue Sep 14, 2023
106195: storage: ensure no filesystem use after close in test builds r=jbowens a=jbowens

In test builds, panic if any of the VFS filesystem facilities are used after the Engine is closed.

Epic: none
Informs #106119.
Resolves #110293.
Release note: None

110576: kvserver: add kill switch for scheduled rangefeed processor r=erikgrinaker a=aliher1911

This commit adds env var `COCKROACH_RANGEFEED_DISABLE_SCHEDULER` which is preventing creation of scheduler based rangefeed processors forcing replicas to always create legacy processors.
This feature allows nodes to be recovered if scheduler based processor failes to process requests.

Epic: none
Fixes: #110559

Release note: None

110625: sql: fix cleanup of not exhausted pausable portals in some cases r=yuzefovich a=yuzefovich

**execinfra: relax RowSource.ConsumerClosed contract**

This commit adjusts RowSource.ConsumerClosed contract so that it's
possible for this method to be called multiple times. Previously, vast
majority of implementations already supported that (and some actually
assumed that), but there were a couple that would result in a panic or
an error if called multiple times - those have been relaxed.

This is needed for the follow-up commit which will introduce an
unconditional call of this method on the "head" processor of the flow
(needed for pausable portals execution model).

Release note: None

**sql: fix cleanup of not exhausted pausable portals in some cases**

Previously, if we executed a flow (either vectorized or row-based) that
contains a row-by-row processor in "pausable portal" model and didn't
fully exhaust that flow (meaning that it could still produce more rows),
this could result in an error when closing the portal because that
processor might not be closed properly. Vectorized operators aren't
affected by this, but row-by-row processors effectively require
`ConsumerClosed` method to be called on them, and this might not happen.
In particular, it was assumed that this method would get called by
`execinfra.Run` loop, but for pausable portal model we needed to adjust
it so that the loop could be exited (and then re-entered) when switching
to another portal; thus, we lost the guarantee that `ConsumerClosed` is
always called.

This commit fixes this problem by teaching both vectorized and row-based
flows to always call `ConsumerClosed` on their "head" processor on
`Cleanup` (which is called when closing the portal). Now, `Cleanup`
calls `ConsumerClosed` unconditionally (when the flow _might_ be running
in the pausable portal mode), and this is safe given the interface
adjustment in the previous commit.

Fixes: #110486

Release note (bug fix): Previously, when evaluating some queries (most
likely with a lookup join) via "multiple active portals" execution
mode (preview feature that can be enabled via
`multiple_active_portals_enabled` session variable) CockroachDB could
encounter an internal error like `unexpected 40960 leftover bytes` if
that portal wasn't fully consumed. This is now fixed.

Co-authored-by: Jackson Owens <[email protected]>
Co-authored-by: Oleg Afanasyev <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Oct 13, 2023
Close the `PartitionedDiskQueue` at the end of
`TestPartitionedDiskQueueSimulatedExternal/HashJoin`.

Fixes: #106119

Release note: None
blathers-crl bot pushed a commit that referenced this issue Oct 13, 2023
@jbowens jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-vec SQL vectorized engine A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). T-sql-queries SQL Queries Team
Projects
Archived in project
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants