Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: panic in newPeer under stress #108499

Closed
yuzefovich opened this issue Aug 10, 2023 · 3 comments · Fixed by #108841
Closed

rpc: panic in newPeer under stress #108499

yuzefovich opened this issue Aug 10, 2023 · 3 comments · Fixed by #108841
Assignees
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.

Comments

@yuzefovich
Copy link
Member

yuzefovich commented Aug 10, 2023

On d9d421e when running ./dev test pkg/sql -f=TestRelocateNonVoters --stress --race --cpus=4 -- --test_env=COCKROACH_TEST_TENANT=true on the gceworker I got the following panic:

* ERROR: a panic has occurred!
* child [5 127.0.0.1:44713 default] already exists 
* (1) attached stack trace
*   -- stack trace: 
*   | runtime.gopanic
*   | › GOROOT/src/runtime/panic.go:890
*   | [...repeated from below...]
* Wraps: (2) assertion failure
* Wraps: (3) attached stack trace
*   -- stack trace: 
*   | github.com/cockroachdb/cockroach/pkg/util/metric/aggmetric.(*childSet).add
*   | › github.com/cockroachdb/cockroach/pkg/util/metric/aggmetric/agg_metric.go:107
*   | github.com/cockroachdb/cockroach/pkg/util/metric/aggmetric.(*AggGauge).AddChild
*   | › github.com/cockroachdb/cockroach/pkg/util/metric/aggmetric/gauge.go:87
*   | github.com/cockroachdb/cockroach/pkg/rpc.(*Metrics).acquire
*   | › github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/metrics.go:189
*   | github.com/cockroachdb/cockroach/pkg/rpc.(*Context).newPeer
*   | › github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/peer.go:178
*   | github.com/cockroachdb/cockroach/pkg/rpc.(*Context).grpcDialNodeInternal
*   | › github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:2157
*   | github.com/cockroachdb/cockroach/pkg/rpc.(*Context).GRPCDialNode
*   | › github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:2114
*   | github.com/cockroachdb/cockroach/pkg/rpc/nodedialer.(*Dialer).dial
*   | › github.com/cockroachdb/cockroach/pkg/rpc/nodedialer/nodedialer.go:170
*   | github.com/cockroachdb/cockroach/pkg/rpc/nodedialer.(*Dialer).Dial
*   | › github.com/cockroachdb/cockroach/pkg/rpc/nodedialer/nodedialer.go:103
*   | github.com/cockroachdb/cockroach/pkg/sql.runnerRequest.run
*   | › github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:112
*   | github.com/cockroachdb/cockroach/pkg/sql.(*runnerCoordinator).init.func1
*   | › github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:152
*   | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
*   | › github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484
*   | runtime.goexit
*   | › src/runtime/asm_amd64.s:1594
* Wraps: (4) child [5 127.0.0.1:44713 default] already exists 
* Error types: (1) *withstack.withStack (2) *assert.withAssertionFailure (3) *withstack.withStack (4) *errutil.leafError

Seems like this code was added recently in #99191.

Stack trace is here

Jira issue: CRDB-30494

@yuzefovich yuzefovich added C-test-failure Broken test (automatically or manually discovered). T-kv-replication labels Aug 10, 2023
@blathers-crl
Copy link

blathers-crl bot commented Aug 10, 2023

cc @cockroachdb/replication

@erikgrinaker
Copy link
Contributor

This was supposed to have been fixed by #107697. Might be a different failure mode.

@erikgrinaker erikgrinaker added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Aug 10, 2023
@blathers-crl
Copy link

blathers-crl bot commented Aug 10, 2023

Hi @erikgrinaker, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. branch-master Failures and bugs on the master branch. labels Aug 10, 2023
craig bot pushed a commit that referenced this issue Aug 16, 2023
108841: rpc: fix missed gauge Unlink r=aliher1911 a=pavelkalinnikov

The `peerMetrics.ConnectionInactive` gauge is not unlinked from the aggregated metric. Correspondingly, when a connection with the same `peerKey` is re-created, the metric with this key is already registered, which causes a panic.

This PR introduces the missed metric `Unlink`. It also add a regression test that would have caught the bug.

Fixes #108499
Epic: CRDB-21710
Release note: none

Co-authored-by: Pavel Kalinnikov <[email protected]>
@craig craig bot closed this as completed in 240d549 Aug 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants