Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: TestMVCCDeleteRangeConcurrentTxn failed #93849

Closed
cockroach-teamcity opened this issue Dec 17, 2022 · 4 comments · Fixed by #93989
Closed

storage: TestMVCCDeleteRangeConcurrentTxn failed #93849

cockroach-teamcity opened this issue Dec 17, 2022 · 4 comments · Fixed by #93989
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-testeng TestEng Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Dec 17, 2022

storage.TestMVCCDeleteRangeConcurrentTxn failed with artifacts on master @ 7c8e7012b73a598f3aa1da9bbf422e833706f5d3:

=== RUN   TestMVCCDeleteRangeConcurrentTxn
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/c5eb8fc8b8e683c19f3c3e4238f64094/logTestMVCCDeleteRangeConcurrentTxn4218633289
    test_log_scope.go:79: use -show-logs to present logs inline
    mvcc_test.go:1627: -- test log scope end --
    mvcc_test.go:1627: Leaked goroutine: goroutine 20 [chan receive]:
        github.com/cockroachdb/cockroach/pkg/util/log.flushDaemon()
        	github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:75 +0x5b
        created by github.com/cockroachdb/cockroach/pkg/util/log.init.5
        	github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:41 +0x25
--- FAIL: TestMVCCDeleteRangeConcurrentTxn (5.04s)

Parameters: TAGS=bazel,gss

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

Jira issue: CRDB-22554

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Dec 17, 2022
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Dec 17, 2022
@blathers-crl blathers-crl bot added the T-storage Storage Team label Dec 17, 2022
@nicktrav
Copy link
Collaborator

This looks like more of an issue in the logging infrastructure, or a very rare flake that depends on goroutine scheduling.

I did some digging into how the leak detection works. Most tests do defer leaktest.AfterTest(t)() at the very start of the test, which will snapshot the goroutines that at running at the time that function is called. The snapshot should already include the logging goroutines, as they are spawned in an init func in pkg/util/log, here.

At the end of the test, the defer'd leak check func runs and snapshots the current goroutines again. If there's a difference, and it's been 5 seconds (to allow for cleanup goroutines to exit, etc.), the test will error if there's a difference in the snapshots.

That check is firing here, saying that we have a goroutine that was started after the initial leak check function call, that has yet to exit.

The flushDaemon goroutine should always be running at the end of the test, as the goroutine does not exit (side note: I assume this is ok?).

I would have thought that the goroutine would be running before we take the initial snapshot too, as the init func in pkg/util/log should have already run at this point, booting up the goroutine. However, now I'm wondering if the runtime only scheduled the goroutine, and had yet to run it when the snapshot was taken?

For what it's worth, I've been running a stress test and I have yet to see a failure.

$ ./dev test ./pkg/storage --filter TestMVCCDeleteRangeConcurrentTxn --stress
...
543987 runs so far, 0 failures, over 45m50s

I think this is just a flake, either in the logging package or in the leak test, rather than a bug in pkg/storage. Going to send this over to Test Eng to see if there's anything that jumps out, either about the test, or the leak check infrastructure.

@nicktrav nicktrav added T-testeng TestEng Team and removed T-storage Storage Team labels Dec 19, 2022
@blathers-crl
Copy link

blathers-crl bot commented Dec 19, 2022

cc @cockroachdb/test-eng

@srosenberg
Copy link
Member

srosenberg commented Dec 20, 2022

The flushDaemon goroutine should always be running at the end of the test, as the goroutine does not exit (side note: I assume this is ok?).

Yep, based on my reading, the flushDaemon is intended to run until the process terminates. Incidentally, there is a bug wrt doSync–it's always false.

I would have thought that the goroutine would be running before we take the initial snapshot too, as the init func in pkg/util/log should have already run at this point, booting up the goroutine. However, now I'm wondering if the runtime only scheduled the goroutine, and had yet to run it when the snapshot was taken?

It appears to be a race. Note, there have been similar races [1] as of recently; potentially, it's related to the same change [2] although any other non-determinism could have influenced the scheduling order. The fact remains that these goroutines are created asynchronously from init() without any guarantee that the goroutines will have been started by the time init() returns.

I think this is just a flake, either in the logging package or in the leak test, rather than a bug in pkg/storage. Going to send this over to Test Eng to see if there's anything that jumps out, either about the test, or the leak check infrastructure.

Thanks for sending it over to TestEng. It is indeed a flake. Luckily, these appear to be fairly infrequent. E.g., looking at all the issues in the past 9 months, there are only two hits for flushDaemon; one is the current issue; the other is [3], from 2 weeks ago.

gh issue list --search "Leaked goroutine" --state all  --json number,title,body --jq '.[] | "\(.number)\n \(.body)"'

Note, we could add an exclusion for flushDaemon similar to how sentry-go worker goroutine is excluded from the snapshot [4].

[1] #93464 (comment)
[2] #92930
[3] #93034
[4]

strings.Contains(stack, "sentry-go.(*HTTPTransport).worker") ||

@nicktrav
Copy link
Collaborator

Note, we could add an exclusion for flushDaemon similar to how sentry-go worker goroutine is excluded from the snapshot [4].

That sounds reasonable, given that the intention of that goroutine is that it's "always" running, and therefore should never "leak". I will put up a patch, at which point we can probably close this out.

Thanks for digging into it!

@nicktrav nicktrav self-assigned this Dec 20, 2022
craig bot pushed a commit that referenced this issue Dec 21, 2022
89613: gossip: remove frequent gossiping of gossip client connections r=irfansharif a=a-robinson

These gossip-clients keys make up two thirds or more of the gossip traffic in various large clusters I've inspected, consuming almost an entire CPU core in the worst case I've seen. They don't provide enough value to justify that sort of ongoing cost, so this commit removes them entirely as well as the periodic logging of the gossip network and the crdb_internal.gossip_network table, both of which relied on them.

These gossip-clients keys make up two thirds or more of the gossip
traffic in various large clusters I've inspected, consuming almost an
entire CPU core in the worst case I've seen. They don't provide enough
value to justify that sort of ongoing cost, so this commit removes them
entirely as well as the periodic logging of the gossip network and the
crdb_internal.gossip_network table, both of which relied on them.

Release note (backward-incompatible change): We've stopped
supporting/populating the crdb_internal.gossip_network table. It was an
internal table with no API guarantees (so perhaps no meriting a release
note?).

Release note (performance improvement): Significantly reduced CPU usage
of the underlying gossip network in large clusters.

---

Informs #51838 (largely fixes it for practical purposes, although there's likely still more that could be done)

This is clearly going to break [the gossip roachtest](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/tests/gossip.go#L50-L58), but between `@irfansharif` kindly volunteering to fix that up separately and his existing TODO in that file I've left that out of this change.

I don't know if completely removing the gossip_network table is really the best idea or if it should just be left in and only populated with the clients from the local node. For example, when run in a mixed version cluster does `debug zip` run all of its sql commands against the local node or does it run some against remote nodes? If an old node ever tries to query the `gossip_network` table on a different node it could have a bad time.

`@irfansharif` `@kvoli` 

93985: ui: degrade gracefully when regions aren't known r=matthewtodd a=matthewtodd

Part of #89949

Previously, when a tenant SQL instance had spun down (leaving us no way to remember which region it had been in), the SQL Activity pages would claim that statements and transactions had occurred in an "undefined" region.

This change moves from saying "undefined" to saying nothing at all, a slightly nicer user experience.

This broader problem of losing the region mapping has been described in #93268; we'll begin addressing it shortly.

Release note: None

93989: leaktest: exclude long running logging goroutines r=srosenberg a=nicktrav

The `leaktest` package detects potential goroutine leaks by snapshotting the set of goroutines running when `leaktest.AfterTest(t)` is called, returning a closure, and comparing the set of goroutines when the closure is called (typically `defer`'d).

A race condition was uncovered in #93849 whereby logging-related goroutines that are scheduled by an `init` function in `pkg/util/logging` can sometimes be spawned _after_ the `AfterTest` function is run. When the test completes and the closure is run, the test fails due to a difference in the before / after goroutine snapshots.

This mode of failure is deemed to be a false-positive. The intention of the logging goroutines are that they live for the duration of the process. However, exactly _when_ the goroutines scheduled in the `init` functions actually start run, and hence show up in the goroutine snapshots, is non-deterministic.

Exclude the logging goroutines from the `leaktest` checks to reduce the flakiness of tests.

Closes #93849.

Release note: None.

Epic: CRDB-20293

Co-authored-by: Alex Robinson <[email protected]>
Co-authored-by: Matthew Todd <[email protected]>
Co-authored-by: Nick Travers <[email protected]>
@craig craig bot closed this as completed in 682f14e Dec 21, 2022
adityamaru pushed a commit to adityamaru/cockroach that referenced this issue Dec 22, 2022
The `leaktest` package detects potential goroutine leaks by snapshotting
the set of goroutines running when `leaktest.AfterTest(t)` is called,
returning a closure, and comparing the set of goroutines when the
closure is called (typically `defer`'d).

A race condition was uncovered in cockroachdb#93849 whereby logging-related
goroutines that are scheduled by an `init` function in
`pkg/util/logging` can sometimes be spawned _after_ the `AfterTest`
function is run. When the test completes and the closure is run, the
test fails due to a difference in the before / after goroutine
snapshots.

This mode of failure is deemed to be a false-positive. The intention of
the logging goroutines are that they live for the duration of the
process. However, exactly _when_ the goroutines scheduled in the `init`
functions actually start run, and hence show up in the goroutine
snapshots, is non-deterministic.

Exclude the logging goroutines from the `leaktest` checks to reduce the
flakiness of tests.

Closes cockroachdb#93849.

Release note: None.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-testeng TestEng Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants