Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/execstats: leaked goroutine in TestTraceAnalyzer #92903

Closed
cockroach-teamcity opened this issue Dec 2, 2022 · 15 comments · Fixed by #93214
Closed

sql/execstats: leaked goroutine in TestTraceAnalyzer #92903

cockroach-teamcity opened this issue Dec 2, 2022 · 15 comments · Fixed by #93214
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-storage Storage Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Dec 2, 2022

sql/execstats.TestTraceAnalyzer failed with artifacts on master @ 7cb778506d75bbef2eb90abccaa75b9dc7e3fb91:

=== RUN   TestTraceAnalyzer
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/8694f6b1bc68c3282fab3cef094820f5/logTestTraceAnalyzer3808448316
    test_log_scope.go:79: use -show-logs to present logs inline
=== CONT  TestTraceAnalyzer
    traceanalyzer_test.go:190: Leaked goroutine: goroutine 3051 [select]:
        github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFS).startTickerLocked.func1()
        	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:302 +0xe5
        created by github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFS).startTickerLocked
        	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:297 +0x7a
    traceanalyzer_test.go:190: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/8694f6b1bc68c3282fab3cef094820f5/logTestTraceAnalyzer3808448316
--- FAIL: TestTraceAnalyzer (5.95s)

Parameters: TAGS=bazel,gss

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/sql-observability

This test on roachdash | Improve this report!

Jira issue: CRDB-22038

Epic CRDB-20293

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Dec 2, 2022
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Dec 2, 2022
@xinhaoz
Copy link
Member

xinhaoz commented Dec 5, 2022

Hi, @nicktrav - the above looks like a pebble issue - as storage on-call can you confirm and I can pass it along if so.

@nicktrav nicktrav added T-storage Storage Team and removed T-sql-observability labels Dec 5, 2022
@blathers-crl blathers-crl bot added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Dec 5, 2022
@nicktrav nicktrav assigned nicktrav and unassigned xinhaoz Dec 5, 2022
@nicktrav
Copy link
Collaborator

nicktrav commented Dec 5, 2022

This reproduces fairly quickly with the following:

$ ./dev test ./pkg/sql/execstats --filter TestTraceAnalyzer --stress
...
--- FAIL: TestTraceAnalyzer (6.23s)
    test_log_scope.go:161: test logs captured to: /home/nickt/Development/go/src/github.com/cockroachdb/cockroach/tmp/_tmp/8694f6b1bc68c3282fab3cef094820f5/logTestTraceAnalyzer805895456
    test_log_scope.go:79: use -show-logs to present logs inline
    traceanalyzer_test.go:190: Leaked goroutine: goroutine 3231 [select]:
        github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFS).startTickerLocked.func1()
                github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:302 +0xe5
        created by github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFS).startTickerLocked
                github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:297 +0x7a
    traceanalyzer_test.go:190: -- test log scope end --

@nicktrav
Copy link
Collaborator

nicktrav commented Dec 6, 2022

I ran with a patch to panic on use of the FS after it has been closed and I see the following:

goroutine 2817 [running]:
github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFS).checkClosed(0xc003a05c20?)
        github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:362 +0xdd
github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFS).RemoveAll(0xc00038af00, {0xc003331230, 0x24})
        github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:494 +0x2c
github.com/cockroachdb/pebble/vfs.(*enospcFS).RemoveAll(0xc002636ba0, {0xc003331230, 0x24})
        github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_full.go:259 +0x4f
github.com/cockroachdb/cockroach/pkg/storage.(*Pebble).RemoveAll(0xc001d795e0?, {0xc003331230?, 0xc003dda720?})
        github.com/cockroachdb/cockroach/pkg/storage/pebble.go:1716 +0x2d
github.com/cockroachdb/cockroach/pkg/sql/colflow.(*vectorizedFlow).Cleanup(0xc001d795e0, {0x57fb380, 0xc003dda720})
        github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go:403 +0x163
github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*RemoteFlowRunner).RunFlow.func1.2()
        github.com/cockroachdb/cockroach/pkg/sql/flowinfra/remote_flow_runner.go:106 +0x85
github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*RemoteFlowRunner).RunFlow.func1.3()
        github.com/cockroachdb/cockroach/pkg/sql/flowinfra/remote_flow_runner.go:114 +0x38
created by github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*RemoteFlowRunner).RunFlow.func1
        github.com/cockroachdb/cockroach/pkg/sql/flowinfra/remote_flow_runner.go:112 +0x2ae

Looks like there's some cleanup happening on the TempFS for the vectorized engine. The cleanup doesn't have a chance to complete, which is being picked by the leak detector.

Cleanup happening in a goroutine here.

I think we can fix this by explicitly calling Close on the temp FS on cleanup, in here. The vfs.FS interface makes this a little tricky, as it doesn't expose Close, but the impl (diskHealthCheckingFS) does. Maybe we should move Close up onto the FS interface?

@nicktrav
Copy link
Collaborator

nicktrav commented Dec 7, 2022

Another option (if I'm understanding this correctly), would be to ensure that that the DistSQLServer containing the TempFS is shut down gracefully (i.e. ensure that Cleanup has been called and completed on all flows.

Might have to pass this off to get some help with that.

@nicktrav nicktrav changed the title sql/execstats: TestTraceAnalyzer failed sql/execstats: leaked goroutine in TestTraceAnalyzer Dec 7, 2022
@yuzefovich
Copy link
Member

The cleanup doesn't have a chance to complete, which is being picked by the leak detector.

Maybe we should change the code to not use go directive directly and rather use the stopper? Then, IIUC, the test will have to wait before the cleanup is complete. I'll send a PR.

@yuzefovich
Copy link
Member

Hm, @nicktrav, is it ok if we proceed with RemoveAll call even after the temp engine is closed? The reason being is that the temp engine is added as the closer very early in the node lifecycle (in newSQLServer) whereas the flow cleanup task will be added later, and the closers are closed in FIFO order.

BTW what does this "leaked goroutine" failure actually mean? That we have some open files when the engine is closed? That we perform some calls on the engine after it was closed?

@nicktrav
Copy link
Collaborator

nicktrav commented Dec 7, 2022

is it ok if we proceed with RemoveAll call even after the temp engine is closed?

From what I gather, it's safe to do that (see here). However, the result is that the timing goroutine will not exit, and hence the "leak".

temp engine is added as the closer very early in the node lifecycle

I see. The comment I linked above mentions that it's safe to call Close again on the diskHealthCheckingFS. However, it's a little unfortunate here in that the closer is the pebbleTempEngine, which will error out if you close its underlying DB more than once.

Stepping back - I think there's a code structuring problem here - imo a vfs.FS would implement io.Closer, which would be idempotent. That feels intuitive. In that case you could just call Close on the temp FS once all the flows complete and have removed their files, etc., as part of cleanup.

Would that work for you here? i.e. call Close on the temp FS once cleanup is done?

what does this "leaked goroutine" failure actually mean?

In this case it's fairly benign - it just means that the goroutine that does the timing of FS operations for the temp FS does not exit. In prod, from what I can tell, the temp FS is a singleton, so there's not really any leak. It all gets torn down on exit.

@nicktrav
Copy link
Collaborator

nicktrav commented Dec 7, 2022

@yuzefovich - would adding a stop.CloserFn that blocks waiting for the cleanup tasks to run be an option? Added before this closer? That way the cleanup tasks do their RemoveAll calls, then the engine is closed (which will cause the goroutines to exit), and then the final cleanup of the temp dirs is done.

@yuzefovich
Copy link
Member

Stepping back - I think there's a code structuring problem here - imo a vfs.FS would implement io.Closer, which would be idempotent. That feels intuitive. In that case you could just call Close on the temp FS once all the flows complete and have removed their files, etc., as part of cleanup.

Would that work for you here? i.e. call Close on the temp FS once cleanup is done?

IIUC your suggestion, this wouldn't work. The setup we have is the following:

  • create a single temp FS on a node start up
  • then we can serve an arbitrary number of flows throughout the node lifetime, and each of them might need some temp storage space
  • for each flow, as an additional precaution (to not leak any of the disk resources) we want to remove all files / folders that that particular flow created
  • on the node shutdown we close the FS.

Accumulating all flows throughout node lifetime could lead to unnecessary large memory usage as well as deferring the removal of all files of the flows could lead to the disk usage leak while the node is up.

would adding a stop.CloserFn that blocks waiting for the cleanup tasks to run be an option? Added before this closer? That way the cleanup tasks do their RemoveAll calls, then the engine is closed (which will cause the goroutines to exit), and then the final cleanup of the temp dirs is done.

This seems like it could work. We could make RemoteFlowRunner keep track of the flow "cleanup" functions that haven't run yet, and on that new CloserFn we'd call those explicitly, before closing the FS.

One concern I have is that each "cleanup" function blocks until the corresponding flow exits - is that acceptable? The comments around Closers don't seem to say anything about the expectation for how long a Closer can take.

@nicktrav
Copy link
Collaborator

nicktrav commented Dec 7, 2022

... and on that new CloserFn we'd call those explicitly, before closing the FS.

One small refinement here - it's probably enough to just make this CloserFn wait for the cleanup tasks to return. No need to call Close on the FS (if we went down the route of adding the method to the interface). The subsequent Closer (for the temp engine) should handle closing the DB and the FS under the hood.

One concern I have is that each "cleanup" function blocks until the corresponding flow exits - is that acceptable? The comments around Closers don't seem to say anything about the expectation for how long a Closer can take.

I'm not sure about this either. I assume the closers run synchronously, so we'd be blocking shutdown. What about cancelling the flow (via its context)? Is that reasonable here? Or does it need to complete by itself?

@yuzefovich
Copy link
Member

yuzefovich commented Dec 7, 2022

I have a prototype in #93214, let's see what CI says.

I assume the closers run synchronously, so we'd be blocking shutdown. What about cancelling the flow (via its context)? Is that reasonable here? Or does it need to complete by itself?

In theory, the flows should already be canceled when the stopper is stopped (because we should start the "drain" process), but maybe that's not the case with an ungraceful shutdown or in tests. I'm currently chasing this down. We do have an option to cancel the flow if necessary, and it should just work - we still would need to block until all the goroutines of the flow actually exit, so there still could be some amount of blocking.

@yuzefovich
Copy link
Member

Here is an idea inspired by Raphael's comment, and maybe it's actually the same / similar to what you mentioned as well.
What if we modify the temp engine to just delete all of the files on its closure? Does it actually already do that? Then, in the vectorized flow Cleanup function we'd avoid call to RemoveAll if the stopper is quiescing? This will make this flake extremely rare without making the production code more likely to have a "leak" of disk resources due to temp storage not being cleaned up properly.

@nicktrav
Copy link
Collaborator

nicktrav commented Dec 7, 2022

I missed this previous comment:

In theory, the flows should already be canceled when the stopper is stopped (because we should start the "drain" process), but maybe that's not the case with an ungraceful shutdown or in tests. I'm currently chasing this down.

If that's the current intention, and we're seeing goroutines touching the FS after the engine has been closed, should we address that? It seems to be an issue in tests.

What if we modify the temp engine to just delete all of the files on its closure?

Our preference would be to avoid that. In Cockroach, a storage.Engine is also an fs.FS, and it follows that once the former is closed, it should be an error to interact with either the former or the latter (something we want to enforce eventually).

Would it be possible to instead fix the sequencing of shutdown events such that all the flows run their cleanup (i.e. RemoveAll, etc.), then the engine is closed, after which point nothing should be interacting with the engine?

Based on the comments in the thread you linked, I realize that may be difficult (the ordering of the stoppers is not guaranteed, etc.). Are there changes we can make to the stopper infrastructure / setup / teardown that could help make this possible? Or some other fix to ensure flows are cleaned up (and finish their RemoveAll calls) before closing the temp engine / fs?

@yuzefovich
Copy link
Member

Would it be possible to instead fix the sequencing of shutdown events such that all the flows run their cleanup (i.e. RemoveAll, etc.), then the engine is closed, after which point nothing should be interacting with the engine?

The easiest way I can think of is #93214.

We probably can adjust the stopper infrastructure to guarantee a particular ordering (it seems to be maintained in practice anyway).

@nicktrav
Copy link
Collaborator

nicktrav commented Dec 8, 2022

The easiest way I can think of is #93214.

SGTM. Thanks!

@craig craig bot closed this as completed in 1ee3989 Dec 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-storage Storage Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants