-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql/execstats: leaked goroutine in TestTraceAnalyzer #92903
Comments
Hi, @nicktrav - the above looks like a pebble issue - as storage on-call can you confirm and I can pass it along if so. |
This reproduces fairly quickly with the following: $ ./dev test ./pkg/sql/execstats --filter TestTraceAnalyzer --stress
...
--- FAIL: TestTraceAnalyzer (6.23s)
test_log_scope.go:161: test logs captured to: /home/nickt/Development/go/src/github.com/cockroachdb/cockroach/tmp/_tmp/8694f6b1bc68c3282fab3cef094820f5/logTestTraceAnalyzer805895456
test_log_scope.go:79: use -show-logs to present logs inline
traceanalyzer_test.go:190: Leaked goroutine: goroutine 3231 [select]:
github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFS).startTickerLocked.func1()
github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:302 +0xe5
created by github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFS).startTickerLocked
github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:297 +0x7a
traceanalyzer_test.go:190: -- test log scope end -- |
I ran with a patch to panic on use of the FS after it has been closed and I see the following: goroutine 2817 [running]:
github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFS).checkClosed(0xc003a05c20?)
github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:362 +0xdd
github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFS).RemoveAll(0xc00038af00, {0xc003331230, 0x24})
github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:494 +0x2c
github.com/cockroachdb/pebble/vfs.(*enospcFS).RemoveAll(0xc002636ba0, {0xc003331230, 0x24})
github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_full.go:259 +0x4f
github.com/cockroachdb/cockroach/pkg/storage.(*Pebble).RemoveAll(0xc001d795e0?, {0xc003331230?, 0xc003dda720?})
github.com/cockroachdb/cockroach/pkg/storage/pebble.go:1716 +0x2d
github.com/cockroachdb/cockroach/pkg/sql/colflow.(*vectorizedFlow).Cleanup(0xc001d795e0, {0x57fb380, 0xc003dda720})
github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go:403 +0x163
github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*RemoteFlowRunner).RunFlow.func1.2()
github.com/cockroachdb/cockroach/pkg/sql/flowinfra/remote_flow_runner.go:106 +0x85
github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*RemoteFlowRunner).RunFlow.func1.3()
github.com/cockroachdb/cockroach/pkg/sql/flowinfra/remote_flow_runner.go:114 +0x38
created by github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*RemoteFlowRunner).RunFlow.func1
github.com/cockroachdb/cockroach/pkg/sql/flowinfra/remote_flow_runner.go:112 +0x2ae Looks like there's some cleanup happening on the TempFS for the vectorized engine. The cleanup doesn't have a chance to complete, which is being picked by the leak detector. Cleanup happening in a goroutine here. I think we can fix this by explicitly calling |
Another option (if I'm understanding this correctly), would be to ensure that that the Might have to pass this off to get some help with that. |
Maybe we should change the code to not use |
Hm, @nicktrav, is it ok if we proceed with BTW what does this "leaked goroutine" failure actually mean? That we have some open files when the engine is closed? That we perform some calls on the engine after it was closed? |
From what I gather, it's safe to do that (see here). However, the result is that the timing goroutine will not exit, and hence the "leak".
I see. The comment I linked above mentions that it's safe to call Stepping back - I think there's a code structuring problem here - imo a Would that work for you here? i.e. call
In this case it's fairly benign - it just means that the goroutine that does the timing of FS operations for the temp FS does not exit. In prod, from what I can tell, the temp FS is a singleton, so there's not really any leak. It all gets torn down on exit. |
@yuzefovich - would adding a |
IIUC your suggestion, this wouldn't work. The setup we have is the following:
Accumulating all flows throughout node lifetime could lead to unnecessary large memory usage as well as deferring the removal of all files of the flows could lead to the disk usage leak while the node is up.
This seems like it could work. We could make One concern I have is that each "cleanup" function blocks until the corresponding flow exits - is that acceptable? The comments around |
One small refinement here - it's probably enough to just make this
I'm not sure about this either. I assume the closers run synchronously, so we'd be blocking shutdown. What about cancelling the flow (via its context)? Is that reasonable here? Or does it need to complete by itself? |
I have a prototype in #93214, let's see what CI says.
In theory, the flows should already be canceled when the stopper is stopped (because we should start the "drain" process), but maybe that's not the case with an ungraceful shutdown or in tests. I'm currently chasing this down. We do have an option to cancel the flow if necessary, and it should just work - we still would need to block until all the goroutines of the flow actually exit, so there still could be some amount of blocking. |
Here is an idea inspired by Raphael's comment, and maybe it's actually the same / similar to what you mentioned as well. |
I missed this previous comment:
If that's the current intention, and we're seeing goroutines touching the FS after the engine has been closed, should we address that? It seems to be an issue in tests.
Our preference would be to avoid that. In Cockroach, a Would it be possible to instead fix the sequencing of shutdown events such that all the flows run their cleanup (i.e. RemoveAll, etc.), then the engine is closed, after which point nothing should be interacting with the engine? Based on the comments in the thread you linked, I realize that may be difficult (the ordering of the stoppers is not guaranteed, etc.). Are there changes we can make to the stopper infrastructure / setup / teardown that could help make this possible? Or some other fix to ensure flows are cleaned up (and finish their |
The easiest way I can think of is #93214. We probably can adjust the stopper infrastructure to guarantee a particular ordering (it seems to be maintained in practice anyway). |
SGTM. Thanks! |
sql/execstats.TestTraceAnalyzer failed with artifacts on master @ 7cb778506d75bbef2eb90abccaa75b9dc7e3fb91:
Parameters:
TAGS=bazel,gss
Help
See also: How To Investigate a Go Test Failure (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-22038
Epic CRDB-20293
The text was updated successfully, but these errors were encountered: