Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: panic in per-peer metrics #105335

Closed
tbg opened this issue Jun 22, 2023 · 5 comments · Fixed by #107697
Closed

rpc: panic in per-peer metrics #105335

tbg opened this issue Jun 22, 2023 · 5 comments · Fixed by #107697
Assignees
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.

Comments

@tbg
Copy link
Member

tbg commented Jun 22, 2023

          Hm, it looks like we hit some other panics here too, when adding RPC metrics around here:

peerMetrics: rpcCtx.metrics.acquire(k),

@tbg made some recent changes here in #99191.

panic: child [3 10.142.1.101:26257 default] already exists [recovered]
        panic: child [3 10.142.1.101:26257 default] already exists

goroutine 1399700 [running]:
panic({0x5a2a400, 0xc017b2f7a0})
        GOROOT/src/runtime/panic.go:987 +0x3ba fp=0xc00d05b388 sp=0xc00d05b2c8 pc=0x49e43a
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).recover(0xe9a0aa?, {0x766e250, 0xc0146360f0})
        github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:230 +0x6a fp=0xc00d05b3d0 sp=0xc00d05b388 pc=0x11cba4a
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2.3()
        github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:478 +0x2e fp=0xc00d05b3f8 sp=0xc00d05b3d0 pc=0x11cd00e
runtime.deferCallSave(0xc00d05b4c8, 0xc018dd5f90?)
        GOROOT/src/runtime/panic.go:796 +0x88 fp=0xc00d05b408 sp=0xc00d05b3f8 pc=0x49e028
runtime.runOpenDeferFrame(0xc00d93e1e0?, 0xc0009be140)
        GOROOT/src/runtime/panic.go:769 +0x1a5 fp=0xc00d05b450 sp=0xc00d05b408 pc=0x49de45
panic({0x5a2a400, 0xc017b2f7a0})
        GOROOT/src/runtime/panic.go:884 +0x212 fp=0xc00d05b510 sp=0xc00d05b450 pc=0x49e292
github.com/cockroachdb/cockroach/pkg/util/metric/aggmetric.(*childSet).add(0xc0098cecc0, {0x7663540, 0xc009adf860})
        github.com/cockroachdb/cockroach/pkg/util/metric/aggmetric/agg_metric.go:107 +0x1d7 fp=0xc00d05b5f8 sp=0xc00d05b510 pc=0x1a9da17
github.com/cockroachdb/cockroach/pkg/util/metric/aggmetric.(*AggGauge).AddChild(...)
        github.com/cockroachdb/cockroach/pkg/util/metric/aggmetric/gauge.go:87
github.com/cockroachdb/cockroach/pkg/rpc.(*Metrics).acquire(0xc0005b7d38, {{0xc000a7dec0?, 0x56869e0?}, 0x10e58f0?, 0xc0?})
        github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/metrics.go:189 +0x31f fp=0xc00d05b6e0 sp=0xc00d05b5f8 pc=0x1b0715f
github.com/cockroachdb/cockroach/pkg/rpc.(*Context).newPeer(0xc0005b7c00, {{0xc000a7dec0?, 0xc000a7dec0?}, 0xb7?, 0x0?})
        github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/peer.go:178 +0x65 fp=0xc00d05b888 sp=0xc00d05b6e0 pc=0x1b079e5
github.com/cockroachdb/cockroach/pkg/rpc.(*Context).grpcDialNodeInternal(0xc0005b7c00, {0xc000a7dec0, 0x12}, 0x3, 0x0)
        github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:2093 +0x453 fp=0xc00d05bb08 sp=0xc00d05b888 pc=0x1b03dd3
github.com/cockroachdb/cockroach/pkg/rpc.(*Context).GRPCDialNode(0xc0122f4c10?, {0xc000a7dec0?, 0xc0008e68c0?}, 0x3?, 0x0?)
        github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:2050 +0x12e fp=0xc00d05bb90 sp=0xc00d05bb08 pc=0x1b038ce
github.com/cockroachdb/cockroach/pkg/rpc/nodedialer.(*Dialer).dial(0xc0038fb860, {0x766e250, 0xc009adf7a0}, 0x30?, {0x7641a50?, 0xc004902698}, 0x1, 0x0?)
        github.com/cockroachdb/cockroach/pkg/rpc/nodedialer/nodedialer.go:170 +0xb1 fp=0xc00d05bc40 sp=0xc00d05bb90 pc=0x1b1afd1
github.com/cockroachdb/cockroach/pkg/rpc/nodedialer.(*Dialer).Dial(0xc0038fb860, {0x766e250, 0xc009adf7a0}, 0x5c4aba0?, 0x0?)
        github.com/cockroachdb/cockroach/pkg/rpc/nodedialer/nodedialer.go:103 +0x11c fp=0xc00d05bcb8 sp=0xc00d05bc40 pc=0x1b1aa5c
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).startProcessNewQueue.func2({0x766e250, 0xc009adf7a0})
        github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/raft_transport.go:925 +0x226 fp=0xc00d05be18 sp=0xc00d05bcb8 pc=0x3994426
runtime/pprof.Do({0x766e250?, 0xc0146360f0?}, {{0xc016fd0fe0?, 0xc0122f4eb8?, 0xc63326?}}, 0xc01a845960)
        GOROOT/src/runtime/pprof/runtime.go:40 +0xa3 fp=0xc00d05be88 sp=0xc00d05be18 pc=0xe9a003
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).startProcessNewQueue.func3({0x766e250, 0xc0146360f0})
        github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/raft_transport.go:947 +0x1c8 fp=0xc00d05bf30 sp=0xc00d05be88 pc=0x39941a8
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
        github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484 +0x146 fp=0xc00d05bfe0 sp=0xc00d05bf30 pc=0x11ccec6
runtime.goexit()
        GOROOT/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc00d05bfe8 sp=0xc00d05bfe0 pc=0x4d35c1
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx
        github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:475 +0x43b

Originally posted by @erikgrinaker in #105260 (comment)

Jira issue: CRDB-28987

@blathers-crl
Copy link

blathers-crl bot commented Jun 22, 2023

Hi @tbg, please add a C-ategory label to your issue. Check out the label system docs.

While you're here, please consider adding an A- label to help keep our repository tidy.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@tbg tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv-replication labels Jun 22, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jun 22, 2023

cc @cockroachdb/replication

@tbg tbg self-assigned this Jun 22, 2023
@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jun 22, 2023

It looks like it was trying to reach n3 on 2 different IP addresses, fwiw. It paniced when it was trying the second one.

E230621 06:48:43.531065 1399656 2@rpc/peer.go:611 ⋮ [T1,n2,rnode=3,raddr=‹10.142.1.36:26257›,class=system,rpc] 137415  failed connection attempt‹ (last connected 5m0.048s ago)›: grpc: ‹connection error: desc = "transport: error while dialing: dial tcp 10.142.1.36:26257: connect: connection refused"› [code 14/Unavailable]
E230621 06:48:43.531218 1399657 2@rpc/peer.go:611 ⋮ [T1,n2,rnode=3,raddr=‹10.142.1.36:26257›,class=default,rpc] 137416  failed connection attempt‹ (last connected 5m0.049s ago)›: grpc: ‹connection error: desc = "transport: error while dialing: dial tcp 10.142.1.36:26257: connect: connection refused"› [code 14/Unavailable]
E230621 06:48:43.646279 1011 2@rpc/peer.go:611 ⋮ [T1,n2,rnode=3,raddr=‹10.142.1.101:26257›,class=system,rpc] 137417  failed connection attempt‹ (last connected 5m0.098s ago)›: grpc: ‹connection error: desc = "transport: error while dialing: dial tcp 10.142.1.101:26257: connect: connection refused"› [code 14/Unavailable]
E230621 06:48:44.029010 1399700 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n2] 137418  a panic has occurred!
E230621 06:48:44.029010 1399700 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n2] 137418 +child [‹3› ‹10.142.1.101:26257› ‹default›] already exists

@erikgrinaker erikgrinaker added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jul 17, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jul 17, 2023

Hi @erikgrinaker, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@erikgrinaker erikgrinaker added the branch-master Failures and bugs on the master branch. label Jul 17, 2023
@tbg
Copy link
Member Author

tbg commented Jul 27, 2023

I think the problem is here:

cockroach/pkg/rpc/peer.go

Lines 860 to 867 in caaa7d8

p.peers.mu.Lock()
delete(p.peers.mu.m, p.k)
p.peers.mu.Unlock()
p.mu.Lock()
p.mu.deleted = true
p.peerMetrics.release()
p.mu.Unlock()

The following can happen:

  1. we lock peers.mu, delete the peer, and unlock again.
  2. someone else calls grpcDialInternal which locks peers.mu and calls newPeer
  3. but we haven't released the metrics for the old peer yet! That only happens later
  4. crash, because now we're adding an identical child.

Should be easy peasy to fix.

craig bot pushed a commit that referenced this issue Jul 28, 2023
107391: roachtest: include link to testeng grafana in issue posts r=smg260,tbg a=annrpom

This adds a link, populated with relevant cluster name and test timeframe, to the testeng grafana instance for failed roachtests.

Fixes: #105894
Release note: None

107659: serverutils: provide SQLConn/SQLConnE in ApplicationLayerInterface r=stevendanna a=knz

Fixes  #107672.
Part of solving #107058.
Informs #106772.

Epic: CRDB-18499



107697: rpc: avoid crash in newPeer r=erikgrinaker a=tbg

It was previously possible to make a new peer while the old one was in the
middle of being deleted, which caused a crash due to to acquiring child metrics
when they still existed.

Luckily, this is easy enough to fix: just remove some premature optimization
where I had tried to be too clever.

Fixes #105335.

Epic: CRDB-21710
Release note: None (bug never released)

107721: asim: skip TestAllocatorSimulatorDeterministic and example_fulldisk r=wenyihu6 a=wenyihu6

We found some non-deterministic behavior in the allocator simulator (see #105904
for more details). For now, we are skipping these potentially flaky tests.

Release Note: None 
Epic: None

107728: persistedsqlstats: specify background qos for compaction job r=xinhaoz a=xinhaoz

The compaction job can be an expensive operation so we should de-prioritize it with the `UserLow` qos setting.

Fixes: #99949

Release note: None

107750: ui: fix app = empty string filter on stmts page r=xinhaoz a=xinhaoz

The filter on app name = empty string was not working on
the stmts page. This was due to the fact that we use (unset)
as the option in the filter to represent selecting the empty
string app name. However when filtering statements, the empty
string app name on the stmt was not changed accordingly.
this commit fixes this and also adds testing for the unset case.

Epic: none
Fixes: #107748

Release note (bug fix): Filter on stmts page works for
app name = empty string (represented as 'unset').

https://www.loom.com/share/2fee4f0fb7b04208803e0dac1d9694ab?sid=5cabecf9-1c2a-406b-89a8-b378ed07d329



107753: backupccl: deflake TestBackupAndRestoreJobDescription r=stevendanna a=adityamaru

This change sorts the jobs based on when they
were created to ensure we get a stable sort of
job descriptions.

Fixes: #107684
Release note: None

Co-authored-by: Annie Pompa <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: wenyihu6 <[email protected]>
Co-authored-by: Xin Hao Zhang <[email protected]>
Co-authored-by: adityamaru <[email protected]>
@craig craig bot closed this as completed in b26eef1 Jul 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants