-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: TestMVCCDeleteRangeConcurrentTxn failed #93849
Comments
This looks like more of an issue in the logging infrastructure, or a very rare flake that depends on goroutine scheduling. I did some digging into how the leak detection works. Most tests do At the end of the test, the defer'd leak check func runs and snapshots the current goroutines again. If there's a difference, and it's been 5 seconds (to allow for cleanup goroutines to exit, etc.), the test will error if there's a difference in the snapshots. That check is firing here, saying that we have a goroutine that was started after the initial leak check function call, that has yet to exit. The I would have thought that the goroutine would be running before we take the initial snapshot too, as the For what it's worth, I've been running a stress test and I have yet to see a failure.
I think this is just a flake, either in the logging package or in the leak test, rather than a bug in |
cc @cockroachdb/test-eng |
Yep, based on my reading, the
It appears to be a race. Note, there have been similar races [1] as of recently; potentially, it's related to the same change [2] although any other non-determinism could have influenced the scheduling order. The fact remains that these goroutines are created asynchronously from
Thanks for sending it over to TestEng. It is indeed a flake. Luckily, these appear to be fairly infrequent. E.g., looking at all the issues in the past 9 months, there are only two hits for
Note, we could add an exclusion for [1] #93464 (comment) cockroach/pkg/util/leaktest/leaktest.go Line 58 in 61f7c81
|
That sounds reasonable, given that the intention of that goroutine is that it's "always" running, and therefore should never "leak". I will put up a patch, at which point we can probably close this out. Thanks for digging into it! |
89613: gossip: remove frequent gossiping of gossip client connections r=irfansharif a=a-robinson These gossip-clients keys make up two thirds or more of the gossip traffic in various large clusters I've inspected, consuming almost an entire CPU core in the worst case I've seen. They don't provide enough value to justify that sort of ongoing cost, so this commit removes them entirely as well as the periodic logging of the gossip network and the crdb_internal.gossip_network table, both of which relied on them. These gossip-clients keys make up two thirds or more of the gossip traffic in various large clusters I've inspected, consuming almost an entire CPU core in the worst case I've seen. They don't provide enough value to justify that sort of ongoing cost, so this commit removes them entirely as well as the periodic logging of the gossip network and the crdb_internal.gossip_network table, both of which relied on them. Release note (backward-incompatible change): We've stopped supporting/populating the crdb_internal.gossip_network table. It was an internal table with no API guarantees (so perhaps no meriting a release note?). Release note (performance improvement): Significantly reduced CPU usage of the underlying gossip network in large clusters. --- Informs #51838 (largely fixes it for practical purposes, although there's likely still more that could be done) This is clearly going to break [the gossip roachtest](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/tests/gossip.go#L50-L58), but between `@irfansharif` kindly volunteering to fix that up separately and his existing TODO in that file I've left that out of this change. I don't know if completely removing the gossip_network table is really the best idea or if it should just be left in and only populated with the clients from the local node. For example, when run in a mixed version cluster does `debug zip` run all of its sql commands against the local node or does it run some against remote nodes? If an old node ever tries to query the `gossip_network` table on a different node it could have a bad time. `@irfansharif` `@kvoli` 93985: ui: degrade gracefully when regions aren't known r=matthewtodd a=matthewtodd Part of #89949 Previously, when a tenant SQL instance had spun down (leaving us no way to remember which region it had been in), the SQL Activity pages would claim that statements and transactions had occurred in an "undefined" region. This change moves from saying "undefined" to saying nothing at all, a slightly nicer user experience. This broader problem of losing the region mapping has been described in #93268; we'll begin addressing it shortly. Release note: None 93989: leaktest: exclude long running logging goroutines r=srosenberg a=nicktrav The `leaktest` package detects potential goroutine leaks by snapshotting the set of goroutines running when `leaktest.AfterTest(t)` is called, returning a closure, and comparing the set of goroutines when the closure is called (typically `defer`'d). A race condition was uncovered in #93849 whereby logging-related goroutines that are scheduled by an `init` function in `pkg/util/logging` can sometimes be spawned _after_ the `AfterTest` function is run. When the test completes and the closure is run, the test fails due to a difference in the before / after goroutine snapshots. This mode of failure is deemed to be a false-positive. The intention of the logging goroutines are that they live for the duration of the process. However, exactly _when_ the goroutines scheduled in the `init` functions actually start run, and hence show up in the goroutine snapshots, is non-deterministic. Exclude the logging goroutines from the `leaktest` checks to reduce the flakiness of tests. Closes #93849. Release note: None. Epic: CRDB-20293 Co-authored-by: Alex Robinson <[email protected]> Co-authored-by: Matthew Todd <[email protected]> Co-authored-by: Nick Travers <[email protected]>
The `leaktest` package detects potential goroutine leaks by snapshotting the set of goroutines running when `leaktest.AfterTest(t)` is called, returning a closure, and comparing the set of goroutines when the closure is called (typically `defer`'d). A race condition was uncovered in cockroachdb#93849 whereby logging-related goroutines that are scheduled by an `init` function in `pkg/util/logging` can sometimes be spawned _after_ the `AfterTest` function is run. When the test completes and the closure is run, the test fails due to a difference in the before / after goroutine snapshots. This mode of failure is deemed to be a false-positive. The intention of the logging goroutines are that they live for the duration of the process. However, exactly _when_ the goroutines scheduled in the `init` functions actually start run, and hence show up in the goroutine snapshots, is non-deterministic. Exclude the logging goroutines from the `leaktest` checks to reduce the flakiness of tests. Closes cockroachdb#93849. Release note: None.
storage.TestMVCCDeleteRangeConcurrentTxn failed with artifacts on master @ 7c8e7012b73a598f3aa1da9bbf422e833706f5d3:
Parameters:
TAGS=bazel,gss
Help
See also: How To Investigate a Go Test Failure (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-22554
The text was updated successfully, but these errors were encountered: