storage: TestMVCCDeleteRangeConcurrentTxn failed #93849

cockroach-teamcity · 2022-12-17T11:53:33Z

storage.TestMVCCDeleteRangeConcurrentTxn failed with artifacts on master @ 7c8e7012b73a598f3aa1da9bbf422e833706f5d3:

=== RUN   TestMVCCDeleteRangeConcurrentTxn
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/c5eb8fc8b8e683c19f3c3e4238f64094/logTestMVCCDeleteRangeConcurrentTxn4218633289
    test_log_scope.go:79: use -show-logs to present logs inline
    mvcc_test.go:1627: -- test log scope end --
    mvcc_test.go:1627: Leaked goroutine: goroutine 20 [chan receive]:
        github.com/cockroachdb/cockroach/pkg/util/log.flushDaemon()
        	github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:75 +0x5b
        created by github.com/cockroachdb/cockroach/pkg/util/log.init.5
        	github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:41 +0x25
--- FAIL: TestMVCCDeleteRangeConcurrentTxn (5.04s)

Parameters: TAGS=bazel,gss

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/storage _{This test on roachdash | Improve this report!

Jira issue: CRDB-22554}

The text was updated successfully, but these errors were encountered:

nicktrav · 2022-12-19T18:28:23Z

This looks like more of an issue in the logging infrastructure, or a very rare flake that depends on goroutine scheduling.

I did some digging into how the leak detection works. Most tests do defer leaktest.AfterTest(t)() at the very start of the test, which will snapshot the goroutines that at running at the time that function is called. The snapshot should already include the logging goroutines, as they are spawned in an init func in pkg/util/log, here.

At the end of the test, the defer'd leak check func runs and snapshots the current goroutines again. If there's a difference, and it's been 5 seconds (to allow for cleanup goroutines to exit, etc.), the test will error if there's a difference in the snapshots.

That check is firing here, saying that we have a goroutine that was started after the initial leak check function call, that has yet to exit.

The flushDaemon goroutine should always be running at the end of the test, as the goroutine does not exit (side note: I assume this is ok?).

I would have thought that the goroutine would be running before we take the initial snapshot too, as the init func in pkg/util/log should have already run at this point, booting up the goroutine. However, now I'm wondering if the runtime only scheduled the goroutine, and had yet to run it when the snapshot was taken?

For what it's worth, I've been running a stress test and I have yet to see a failure.

$ ./dev test ./pkg/storage --filter TestMVCCDeleteRangeConcurrentTxn --stress
...
543987 runs so far, 0 failures, over 45m50s

I think this is just a flake, either in the logging package or in the leak test, rather than a bug in pkg/storage. Going to send this over to Test Eng to see if there's anything that jumps out, either about the test, or the leak check infrastructure.

blathers-crl · 2022-12-19T18:28:37Z

cc @cockroachdb/test-eng

srosenberg · 2022-12-20T04:18:51Z

The flushDaemon goroutine should always be running at the end of the test, as the goroutine does not exit (side note: I assume this is ok?).

Yep, based on my reading, the flushDaemon is intended to run until the process terminates. Incidentally, there is a bug wrt doSync–it's always false.

I would have thought that the goroutine would be running before we take the initial snapshot too, as the init func in pkg/util/log should have already run at this point, booting up the goroutine. However, now I'm wondering if the runtime only scheduled the goroutine, and had yet to run it when the snapshot was taken?

It appears to be a race. Note, there have been similar races [1] as of recently; potentially, it's related to the same change [2] although any other non-determinism could have influenced the scheduling order. The fact remains that these goroutines are created asynchronously from init() without any guarantee that the goroutines will have been started by the time init() returns.

I think this is just a flake, either in the logging package or in the leak test, rather than a bug in pkg/storage. Going to send this over to Test Eng to see if there's anything that jumps out, either about the test, or the leak check infrastructure.

Thanks for sending it over to TestEng. It is indeed a flake. Luckily, these appear to be fairly infrequent. E.g., looking at all the issues in the past 9 months, there are only two hits for flushDaemon; one is the current issue; the other is [3], from 2 weeks ago.

gh issue list --search "Leaked goroutine" --state all  --json number,title,body --jq '.[] | "\(.number)\n \(.body)"'

Note, we could add an exclusion for flushDaemon similar to how sentry-go worker goroutine is excluded from the snapshot [4].

[1] #93464 (comment)
[2] #92930
[3] #93034
[4]

cockroach/pkg/util/leaktest/leaktest.go

Line 58 in 61f7c81

strings.Contains(stack, "sentry-go.(*HTTPTransport).worker") ||

nicktrav · 2022-12-20T16:14:27Z

Note, we could add an exclusion for flushDaemon similar to how sentry-go worker goroutine is excluded from the snapshot [4].

That sounds reasonable, given that the intention of that goroutine is that it's "always" running, and therefore should never "leak". I will put up a patch, at which point we can probably close this out.

Thanks for digging into it!

89613: gossip: remove frequent gossiping of gossip client connections r=irfansharif a=a-robinson These gossip-clients keys make up two thirds or more of the gossip traffic in various large clusters I've inspected, consuming almost an entire CPU core in the worst case I've seen. They don't provide enough value to justify that sort of ongoing cost, so this commit removes them entirely as well as the periodic logging of the gossip network and the crdb_internal.gossip_network table, both of which relied on them. These gossip-clients keys make up two thirds or more of the gossip traffic in various large clusters I've inspected, consuming almost an entire CPU core in the worst case I've seen. They don't provide enough value to justify that sort of ongoing cost, so this commit removes them entirely as well as the periodic logging of the gossip network and the crdb_internal.gossip_network table, both of which relied on them. Release note (backward-incompatible change): We've stopped supporting/populating the crdb_internal.gossip_network table. It was an internal table with no API guarantees (so perhaps no meriting a release note?). Release note (performance improvement): Significantly reduced CPU usage of the underlying gossip network in large clusters. --- Informs #51838 (largely fixes it for practical purposes, although there's likely still more that could be done) This is clearly going to break [the gossip roachtest](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/tests/gossip.go#L50-L58), but between `@irfansharif` kindly volunteering to fix that up separately and his existing TODO in that file I've left that out of this change. I don't know if completely removing the gossip_network table is really the best idea or if it should just be left in and only populated with the clients from the local node. For example, when run in a mixed version cluster does `debug zip` run all of its sql commands against the local node or does it run some against remote nodes? If an old node ever tries to query the `gossip_network` table on a different node it could have a bad time. `@irfansharif` `@kvoli` 93985: ui: degrade gracefully when regions aren't known r=matthewtodd a=matthewtodd Part of #89949 Previously, when a tenant SQL instance had spun down (leaving us no way to remember which region it had been in), the SQL Activity pages would claim that statements and transactions had occurred in an "undefined" region. This change moves from saying "undefined" to saying nothing at all, a slightly nicer user experience. This broader problem of losing the region mapping has been described in #93268; we'll begin addressing it shortly. Release note: None 93989: leaktest: exclude long running logging goroutines r=srosenberg a=nicktrav The `leaktest` package detects potential goroutine leaks by snapshotting the set of goroutines running when `leaktest.AfterTest(t)` is called, returning a closure, and comparing the set of goroutines when the closure is called (typically `defer`'d). A race condition was uncovered in #93849 whereby logging-related goroutines that are scheduled by an `init` function in `pkg/util/logging` can sometimes be spawned _after_ the `AfterTest` function is run. When the test completes and the closure is run, the test fails due to a difference in the before / after goroutine snapshots. This mode of failure is deemed to be a false-positive. The intention of the logging goroutines are that they live for the duration of the process. However, exactly _when_ the goroutines scheduled in the `init` functions actually start run, and hence show up in the goroutine snapshots, is non-deterministic. Exclude the logging goroutines from the `leaktest` checks to reduce the flakiness of tests. Closes #93849. Release note: None. Epic: CRDB-20293 Co-authored-by: Alex Robinson <[email protected]> Co-authored-by: Matthew Todd <[email protected]> Co-authored-by: Nick Travers <[email protected]>

The `leaktest` package detects potential goroutine leaks by snapshotting the set of goroutines running when `leaktest.AfterTest(t)` is called, returning a closure, and comparing the set of goroutines when the closure is called (typically `defer`'d). A race condition was uncovered in cockroachdb#93849 whereby logging-related goroutines that are scheduled by an `init` function in `pkg/util/logging` can sometimes be spawned _after_ the `AfterTest` function is run. When the test completes and the closure is run, the test fails due to a difference in the before / after goroutine snapshots. This mode of failure is deemed to be a false-positive. The intention of the logging goroutines are that they live for the duration of the process. However, exactly _when_ the goroutines scheduled in the `init` functions actually start run, and hence show up in the goroutine snapshots, is non-deterministic. Exclude the logging goroutines from the `leaktest` checks to reduce the flakiness of tests. Closes cockroachdb#93849. Release note: None.

cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Dec 17, 2022

cockroach-teamcity added this to the 23.1 milestone Dec 17, 2022

blathers-crl bot added the T-storage Storage Team label Dec 17, 2022

nicktrav added T-testeng TestEng Team and removed T-storage Storage Team labels Dec 19, 2022

srosenberg mentioned this issue Dec 20, 2022

server: fix log flushDaemon incorrectly ignores syncInterval #93956

Merged

nicktrav mentioned this issue Dec 20, 2022

leaktest: exclude long running logging goroutines #93989

Merged

nicktrav self-assigned this Dec 20, 2022

craig bot closed this as completed in 682f14e Dec 21, 2022

abarganier mentioned this issue Mar 21, 2023

storage: TestMVCCGarbageCollectIntent failed #93034

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: TestMVCCDeleteRangeConcurrentTxn failed #93849

storage: TestMVCCDeleteRangeConcurrentTxn failed #93849

cockroach-teamcity commented Dec 17, 2022 •

edited by cockroach-jira-scripts

Loading

nicktrav commented Dec 19, 2022

blathers-crl bot commented Dec 19, 2022

srosenberg commented Dec 20, 2022 •

edited

Loading

nicktrav commented Dec 20, 2022

storage: TestMVCCDeleteRangeConcurrentTxn failed #93849

storage: TestMVCCDeleteRangeConcurrentTxn failed #93849

Comments

cockroach-teamcity commented Dec 17, 2022 • edited by cockroach-jira-scripts Loading

nicktrav commented Dec 19, 2022

blathers-crl bot commented Dec 19, 2022

srosenberg commented Dec 20, 2022 • edited Loading

nicktrav commented Dec 20, 2022

cockroach-teamcity commented Dec 17, 2022 •

edited by cockroach-jira-scripts

Loading

srosenberg commented Dec 20, 2022 •

edited

Loading