sql/execstats: leaked goroutine in TestTraceAnalyzer #92903

cockroach-teamcity · 2022-12-02T12:22:03Z

sql/execstats.TestTraceAnalyzer failed with artifacts on master @ 7cb778506d75bbef2eb90abccaa75b9dc7e3fb91:

=== RUN   TestTraceAnalyzer
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/8694f6b1bc68c3282fab3cef094820f5/logTestTraceAnalyzer3808448316
    test_log_scope.go:79: use -show-logs to present logs inline
=== CONT  TestTraceAnalyzer
    traceanalyzer_test.go:190: Leaked goroutine: goroutine 3051 [select]:
        github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFS).startTickerLocked.func1()
        	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:302 +0xe5
        created by github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFS).startTickerLocked
        	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:297 +0x7a
    traceanalyzer_test.go:190: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/8694f6b1bc68c3282fab3cef094820f5/logTestTraceAnalyzer3808448316
--- FAIL: TestTraceAnalyzer (5.95s)

Parameters: TAGS=bazel,gss

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/sql-observability _{This test on roachdash | Improve this report!

Jira issue: CRDB-22038
Epic CRDB-20293}

The text was updated successfully, but these errors were encountered:

xinhaoz · 2022-12-05T18:54:25Z

Hi, @nicktrav - the above looks like a pebble issue - as storage on-call can you confirm and I can pass it along if so.

nicktrav · 2022-12-05T19:37:25Z

This reproduces fairly quickly with the following:

$ ./dev test ./pkg/sql/execstats --filter TestTraceAnalyzer --stress
...
--- FAIL: TestTraceAnalyzer (6.23s)
    test_log_scope.go:161: test logs captured to: /home/nickt/Development/go/src/github.com/cockroachdb/cockroach/tmp/_tmp/8694f6b1bc68c3282fab3cef094820f5/logTestTraceAnalyzer805895456
    test_log_scope.go:79: use -show-logs to present logs inline
    traceanalyzer_test.go:190: Leaked goroutine: goroutine 3231 [select]:
        github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFS).startTickerLocked.func1()
                github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:302 +0xe5
        created by github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFS).startTickerLocked
                github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:297 +0x7a
    traceanalyzer_test.go:190: -- test log scope end --

nicktrav · 2022-12-06T19:27:15Z

I ran with a patch to panic on use of the FS after it has been closed and I see the following:

goroutine 2817 [running]:
github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFS).checkClosed(0xc003a05c20?)
        github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:362 +0xdd
github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFS).RemoveAll(0xc00038af00, {0xc003331230, 0x24})
        github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:494 +0x2c
github.com/cockroachdb/pebble/vfs.(*enospcFS).RemoveAll(0xc002636ba0, {0xc003331230, 0x24})
        github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_full.go:259 +0x4f
github.com/cockroachdb/cockroach/pkg/storage.(*Pebble).RemoveAll(0xc001d795e0?, {0xc003331230?, 0xc003dda720?})
        github.com/cockroachdb/cockroach/pkg/storage/pebble.go:1716 +0x2d
github.com/cockroachdb/cockroach/pkg/sql/colflow.(*vectorizedFlow).Cleanup(0xc001d795e0, {0x57fb380, 0xc003dda720})
        github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go:403 +0x163
github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*RemoteFlowRunner).RunFlow.func1.2()
        github.com/cockroachdb/cockroach/pkg/sql/flowinfra/remote_flow_runner.go:106 +0x85
github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*RemoteFlowRunner).RunFlow.func1.3()
        github.com/cockroachdb/cockroach/pkg/sql/flowinfra/remote_flow_runner.go:114 +0x38
created by github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*RemoteFlowRunner).RunFlow.func1
        github.com/cockroachdb/cockroach/pkg/sql/flowinfra/remote_flow_runner.go:112 +0x2ae

Looks like there's some cleanup happening on the TempFS for the vectorized engine. The cleanup doesn't have a chance to complete, which is being picked by the leak detector.

Cleanup happening in a goroutine here.

I think we can fix this by explicitly calling Close on the temp FS on cleanup, in here. The vfs.FS interface makes this a little tricky, as it doesn't expose Close, but the impl (diskHealthCheckingFS) does. Maybe we should move Close up onto the FS interface?

nicktrav · 2022-12-07T16:48:41Z

Another option (if I'm understanding this correctly), would be to ensure that that the DistSQLServer containing the TempFS is shut down gracefully (i.e. ensure that Cleanup has been called and completed on all flows.

Might have to pass this off to get some help with that.

yuzefovich · 2022-12-07T17:21:56Z

The cleanup doesn't have a chance to complete, which is being picked by the leak detector.

Maybe we should change the code to not use go directive directly and rather use the stopper? Then, IIUC, the test will have to wait before the cleanup is complete. I'll send a PR.

yuzefovich · 2022-12-07T17:41:03Z

Hm, @nicktrav, is it ok if we proceed with RemoveAll call even after the temp engine is closed? The reason being is that the temp engine is added as the closer very early in the node lifecycle (in newSQLServer) whereas the flow cleanup task will be added later, and the closers are closed in FIFO order.

BTW what does this "leaked goroutine" failure actually mean? That we have some open files when the engine is closed? That we perform some calls on the engine after it was closed?

nicktrav · 2022-12-07T18:25:45Z

is it ok if we proceed with RemoveAll call even after the temp engine is closed?

From what I gather, it's safe to do that (see here). However, the result is that the timing goroutine will not exit, and hence the "leak".

temp engine is added as the closer very early in the node lifecycle

I see. The comment I linked above mentions that it's safe to call Close again on the diskHealthCheckingFS. However, it's a little unfortunate here in that the closer is the pebbleTempEngine, which will error out if you close its underlying DB more than once.

Stepping back - I think there's a code structuring problem here - imo a vfs.FS would implement io.Closer, which would be idempotent. That feels intuitive. In that case you could just call Close on the temp FS once all the flows complete and have removed their files, etc., as part of cleanup.

Would that work for you here? i.e. call Close on the temp FS once cleanup is done?

what does this "leaked goroutine" failure actually mean?

In this case it's fairly benign - it just means that the goroutine that does the timing of FS operations for the temp FS does not exit. In prod, from what I can tell, the temp FS is a singleton, so there's not really any leak. It all gets torn down on exit.

nicktrav · 2022-12-07T18:39:54Z

@yuzefovich - would adding a stop.CloserFn that blocks waiting for the cleanup tasks to run be an option? Added before this closer? That way the cleanup tasks do their RemoveAll calls, then the engine is closed (which will cause the goroutines to exit), and then the final cleanup of the temp dirs is done.

yuzefovich · 2022-12-07T19:03:22Z

Stepping back - I think there's a code structuring problem here - imo a vfs.FS would implement io.Closer, which would be idempotent. That feels intuitive. In that case you could just call Close on the temp FS once all the flows complete and have removed their files, etc., as part of cleanup.

Would that work for you here? i.e. call Close on the temp FS once cleanup is done?

IIUC your suggestion, this wouldn't work. The setup we have is the following:

create a single temp FS on a node start up
then we can serve an arbitrary number of flows throughout the node lifetime, and each of them might need some temp storage space
for each flow, as an additional precaution (to not leak any of the disk resources) we want to remove all files / folders that that particular flow created
on the node shutdown we close the FS.

Accumulating all flows throughout node lifetime could lead to unnecessary large memory usage as well as deferring the removal of all files of the flows could lead to the disk usage leak while the node is up.

would adding a stop.CloserFn that blocks waiting for the cleanup tasks to run be an option? Added before this closer? That way the cleanup tasks do their RemoveAll calls, then the engine is closed (which will cause the goroutines to exit), and then the final cleanup of the temp dirs is done.

This seems like it could work. We could make RemoteFlowRunner keep track of the flow "cleanup" functions that haven't run yet, and on that new CloserFn we'd call those explicitly, before closing the FS.

One concern I have is that each "cleanup" function blocks until the corresponding flow exits - is that acceptable? The comments around Closers don't seem to say anything about the expectation for how long a Closer can take.

nicktrav · 2022-12-07T19:27:58Z

... and on that new CloserFn we'd call those explicitly, before closing the FS.

One small refinement here - it's probably enough to just make this CloserFn wait for the cleanup tasks to return. No need to call Close on the FS (if we went down the route of adding the method to the interface). The subsequent Closer (for the temp engine) should handle closing the DB and the FS under the hood.

One concern I have is that each "cleanup" function blocks until the corresponding flow exits - is that acceptable? The comments around Closers don't seem to say anything about the expectation for how long a Closer can take.

I'm not sure about this either. I assume the closers run synchronously, so we'd be blocking shutdown. What about cancelling the flow (via its context)? Is that reasonable here? Or does it need to complete by itself?

yuzefovich · 2022-12-07T19:45:57Z

I have a prototype in #93214, let's see what CI says.

I assume the closers run synchronously, so we'd be blocking shutdown. What about cancelling the flow (via its context)? Is that reasonable here? Or does it need to complete by itself?

In theory, the flows should already be canceled when the stopper is stopped (because we should start the "drain" process), but maybe that's not the case with an ungraceful shutdown or in tests. I'm currently chasing this down. We do have an option to cancel the flow if necessary, and it should just work - we still would need to block until all the goroutines of the flow actually exit, so there still could be some amount of blocking.

yuzefovich · 2022-12-07T21:41:23Z

Here is an idea inspired by Raphael's comment, and maybe it's actually the same / similar to what you mentioned as well.
What if we modify the temp engine to just delete all of the files on its closure? Does it actually already do that? Then, in the vectorized flow Cleanup function we'd avoid call to RemoveAll if the stopper is quiescing? This will make this flake extremely rare without making the production code more likely to have a "leak" of disk resources due to temp storage not being cleaned up properly.

nicktrav · 2022-12-07T23:19:10Z

I missed this previous comment:

In theory, the flows should already be canceled when the stopper is stopped (because we should start the "drain" process), but maybe that's not the case with an ungraceful shutdown or in tests. I'm currently chasing this down.

If that's the current intention, and we're seeing goroutines touching the FS after the engine has been closed, should we address that? It seems to be an issue in tests.

What if we modify the temp engine to just delete all of the files on its closure?

Our preference would be to avoid that. In Cockroach, a storage.Engine is also an fs.FS, and it follows that once the former is closed, it should be an error to interact with either the former or the latter (something we want to enforce eventually).

Would it be possible to instead fix the sequencing of shutdown events such that all the flows run their cleanup (i.e. RemoveAll, etc.), then the engine is closed, after which point nothing should be interacting with the engine?

Based on the comments in the thread you linked, I realize that may be difficult (the ordering of the stoppers is not guaranteed, etc.). Are there changes we can make to the stopper infrastructure / setup / teardown that could help make this possible? Or some other fix to ensure flows are cleaned up (and finish their RemoveAll calls) before closing the temp engine / fs?

yuzefovich · 2022-12-08T00:38:01Z

Would it be possible to instead fix the sequencing of shutdown events such that all the flows run their cleanup (i.e. RemoveAll, etc.), then the engine is closed, after which point nothing should be interacting with the engine?

The easiest way I can think of is #93214.

We probably can adjust the stopper infrastructure to guarantee a particular ordering (it seems to be maintained in practice anyway).

nicktrav · 2022-12-08T00:54:08Z

The easiest way I can think of is #93214.

SGTM. Thanks!

cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Dec 2, 2022

cockroach-teamcity added this to the 23.1 milestone Dec 2, 2022

blathers-crl bot added the T-sql-observability label Dec 2, 2022

kevin-v-ngo assigned xinhaoz Dec 5, 2022

nicktrav added T-storage Storage Team and removed T-sql-observability labels Dec 5, 2022

blathers-crl bot added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Dec 5, 2022

nicktrav assigned nicktrav and unassigned xinhaoz Dec 5, 2022

nicktrav mentioned this issue Dec 7, 2022

storage: possible goroutine leak from pebble #92026

Closed

nicktrav changed the title ~~sql/execstats: TestTraceAnalyzer failed~~ sql/execstats: leaked goroutine in TestTraceAnalyzer Dec 7, 2022

nicktrav mentioned this issue Dec 7, 2022

logictestccl: leaked goroutine in fakedist-disk test #92504

Closed

yuzefovich mentioned this issue Dec 7, 2022

flowinfra: create flow runner waiter goroutine as a task #93214

Merged

craig bot closed this as completed in 1ee3989 Dec 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql/execstats: leaked goroutine in TestTraceAnalyzer #92903

sql/execstats: leaked goroutine in TestTraceAnalyzer #92903

cockroach-teamcity commented Dec 2, 2022 •

edited by exalate-issue-sync bot

Loading

xinhaoz commented Dec 5, 2022

nicktrav commented Dec 5, 2022

nicktrav commented Dec 6, 2022 •

edited

Loading

nicktrav commented Dec 7, 2022

yuzefovich commented Dec 7, 2022

yuzefovich commented Dec 7, 2022

nicktrav commented Dec 7, 2022

nicktrav commented Dec 7, 2022

yuzefovich commented Dec 7, 2022

nicktrav commented Dec 7, 2022

yuzefovich commented Dec 7, 2022 •

edited

Loading

yuzefovich commented Dec 7, 2022

nicktrav commented Dec 7, 2022

yuzefovich commented Dec 8, 2022

nicktrav commented Dec 8, 2022

sql/execstats: leaked goroutine in TestTraceAnalyzer #92903

sql/execstats: leaked goroutine in TestTraceAnalyzer #92903

Comments

cockroach-teamcity commented Dec 2, 2022 • edited by exalate-issue-sync bot Loading

xinhaoz commented Dec 5, 2022

nicktrav commented Dec 5, 2022

nicktrav commented Dec 6, 2022 • edited Loading

nicktrav commented Dec 7, 2022

yuzefovich commented Dec 7, 2022

yuzefovich commented Dec 7, 2022

nicktrav commented Dec 7, 2022

nicktrav commented Dec 7, 2022

yuzefovich commented Dec 7, 2022

nicktrav commented Dec 7, 2022

yuzefovich commented Dec 7, 2022 • edited Loading

yuzefovich commented Dec 7, 2022

nicktrav commented Dec 7, 2022

yuzefovich commented Dec 8, 2022

nicktrav commented Dec 8, 2022

cockroach-teamcity commented Dec 2, 2022 •

edited by exalate-issue-sync bot

Loading

nicktrav commented Dec 6, 2022 •

edited

Loading

yuzefovich commented Dec 7, 2022 •

edited

Loading