Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: use of span after finish #108534

Closed
yuzefovich opened this issue Aug 10, 2023 · 7 comments · Fixed by #108775
Closed

kvserver: use of span after finish #108534

yuzefovich opened this issue Aug 10, 2023 · 7 comments · Fixed by #108775
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). X-duplicate Closed as a duplicate of another issue.

Comments

@yuzefovich
Copy link
Member

yuzefovich commented Aug 10, 2023

On d9d421e when running ./dev test pkg/sql -f=TestRelocateNonVoters --stress --race --cpus=4 -- --test_env=COCKROACH_TEST_TENANT=true on the gceworker I got the following panic:

* ERROR: a panic has occurred!
* use of Span after Finish. Span: /cockroach.roachpb.Internal/Batch. Finish previously called at: <stack not captured. Set debugUseAfterFinish>
* (1) attached stack trace
*   -- stack trace:
*   | runtime.gopanic
*   | › GOROOT/src/runtime/panic.go:890
*   | [...repeated from below...]
* Wraps: (2) assertion failure
* Wraps: (3) attached stack trace
*   -- stack trace:
*   | github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).detectUseAfterFinish
*   | › github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:182
*   | github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).RecordingType
*   | › github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:442
*   | github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).IsVerbose
*   | › github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:450
*   | github.com/cockroachdb/cockroach/pkg/util/log.getSpanOrEventLog
*   | › github.com/cockroachdb/cockroach/pkg/util/log/trace.go:92
*   | github.com/cockroachdb/cockroach/pkg/util/log.vEventf
*   | › github.com/cockroachdb/cockroach/pkg/util/log/trace.go:212
*   | github.com/cockroachdb/cockroach/pkg/util/log.VEvent
*   | › github.com/cockroachdb/cockroach/pkg/util/log/trace.go:231
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*propBuf).FlushLockedWithRaftGroup
*   | › github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_proposal_buf.go:601
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked.func2
*   | › github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:775
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).withRaftGroupLocked
*   | › github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1998
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked
*   | › github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:772
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady
*   | › github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:728
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady
*   | › github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:689
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftSchedulerShard).worker
*   | › github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:416
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).Start.func2
*   | › github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:321
*   | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
*   | › github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484
*   | runtime.goexit
*   | › src/runtime/asm_amd64.s:1594
* Wraps: (4) use of Span after Finish. Span: /cockroach.roachpb.Internal/Batch. Finish previously called at: <stack not captured. Set debugUseAfterFinish>

This logging seems to be added recently in 60b26ae.

Stack trace is here

Jira issue: CRDB-30511

@yuzefovich yuzefovich added C-test-failure Broken test (automatically or manually discovered). T-kv-replication labels Aug 10, 2023
@yuzefovich
Copy link
Member Author

cc @cockroachdb/replication

1 similar comment
@blathers-crl
Copy link

blathers-crl bot commented Aug 10, 2023

cc @cockroachdb/replication

@yuzefovich
Copy link
Member Author

Another instance with the same repro steps

* use of Span after Finish. Span: /cockroach.roachpb.Internal/Batch. Finish previously called at: <stack not captured. Set debugUseAfterFinish>
* (1) attached stack trace
*   -- stack trace:
*   | runtime.gopanic
*   | 	GOROOT/src/runtime/panic.go:890
*   | [...repeated from below...]
* Wraps: (2) assertion failure
* Wraps: (3) attached stack trace
*   -- stack trace:
*   | github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).detectUseAfterFinish
*   | 	github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:182
*   | github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).Tracer
*   | 	github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:225
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaDecoder).createTracingSpans
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_decoder.go:154
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaDecoder).DecodeAndBind
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_decoder.go:63
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).Decode
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:142
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:878
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:728
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:689
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftSchedulerShard).worker
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:416
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).Start.func2
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:321
*   | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
*   | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484
*   | runtime.goexit
*   | 	src/runtime/asm_amd64.s:1594

@pav-kv pav-kv self-assigned this Aug 11, 2023
@pav-kv
Copy link
Collaborator

pav-kv commented Aug 11, 2023

Dup of #107521. Thanks for the repro.

@pav-kv pav-kv added A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. X-duplicate Closed as a duplicate of another issue. labels Aug 11, 2023
@pav-kv
Copy link
Collaborator

pav-kv commented Aug 11, 2023

Stressed this with COCKROACH_DEBUG_SPAN_USE_AFTER_FINISH=true, which makes this panic print the stack where Finish was called the first time. It is here:

reqSp.finish(br, redact)

I.e. the batch request returned, and closed its span, but the corresponding proposal pops up later with the same span (sometimes in proposal buffer, sometimes in raft application; see the two stacks quoted above).

@pav-kv
Copy link
Collaborator

pav-kv commented Aug 12, 2023

In one repro run I see:

I230812 13:22:17.088626 7781 kv/kvserver/replica_proposal_buf.go:601  [T1,n5,s5,r2/3:/System/NodeLiveness{-Max}] 14007  flushing proposal to Raft
W230812 13:22:17.092620 6311 kv/kvserver/liveness/liveness.go:690  [T1,n4,liveness-hb] 14009  failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 3.347s (given timeout 3s): result is ambiguous: context done during DistSender.Send: ba: ConditionalPut [/System/NodeLiveness/4,/Min), EndTxn(commit modified-span (node-liveness)) [/System/NodeLiveness/4], [txn: 34d1ac27], [can-forward-ts] RPC error: grpc: context deadline exceeded [code 4/DeadlineExceeded]

// <200ms later, in handleRaftReadyRaftMuLocked stack

E230812 13:22:17.218841 7781 1@util/log/logcrash/crash_reporting.go:189  [T1,n5] 14017  a panic has occurred!
E230812 13:22:17.218841 7781 1@util/log/logcrash/crash_reporting.go:189  [T1,n5] 14017 +use of Span after Finish. Span: /cockroach.roachpb.Internal/Batch. Finish previously called at: goroutine 99035 [running]:

Conjecture: a DistSender.Send gave up and canceled a slow RPC, making the receiver node Finish the corresponding span. However, the proposal was still in the receiver's pipeline, so a racing handleRaftReadyRaftMuLocked tried to use this proposal's context (for logging events), and saw it already Finish-ed.

@pav-kv
Copy link
Collaborator

pav-kv commented Aug 14, 2023

With some extra logging, I can see the following sequence of events:

// Starting a node liveness transaction e6b2d71d.
I230812 20:04:50.685598 191965 server/node.go:1317  [T1,n4] 135817  starting request: ConditionalPut [/System/NodeLiveness/5,/Min), EndTxn(commit modified-span (node-liveness)) [/System/NodeLiveness/5], [txn: e6b2d71d], [can-forward-ts]

// 1 second later, there is a txn heartbeat.
I230812 20:04:51.662320 192756 server/node.go:1317  [T1,n4] 135964  starting request: HeartbeatTxn [/System/NodeLiveness/5,/Min), [txn: e6b2d71d]

// 3 seconds after starting the liveness transaction, it gets abandoned.
// (which also closes the tracing Span)
I230812 20:04:53.670457 191965 server/node.go:1314  [T1,n4] 136354  finishing request: ConditionalPut [/System/NodeLiveness/5,/Min), EndTxn(commit modified-span (node-liveness)) [/System/NodeLiveness/5], [txn: e6b2d71d], [can-forward-ts]

// And liveness ping sender logs the failed request.
W230812 20:04:53.758226 8060 kv/kvserver/liveness/liveness.go:774  [T1,n5,liveness-hb] 136368  slow heartbeat took 3.236503544s; err=result is ambiguous: context done during DistSender.Send: ba: ConditionalPut [/System/NodeLiveness/5,/Min), EndTxn(commit modified-span (node-liveness)) [/System/NodeLiveness/5], [txn: e6b2d71d], [can-forward-ts] RPC error: grpc: context deadline exceeded [code 4/DeadlineExceeded]

// The transaction is terminated.
I230812 20:04:53.760915 193021 server/node.go:1317  [T1,n4] 136370  starting request: EndTxn(abort) [/System/NodeLiveness/5], [txn: e6b2d71d], [can-forward-ts]
I230812 20:04:53.765921 192756 server/node.go:1314  [T1,n4] 136373  finishing request: HeartbeatTxn [/System/NodeLiveness/5,/Min), [txn: e6b2d71d]
W230812 20:04:53.819169 8060 kv/kvserver/liveness/liveness.go:690  [T1,n5,liveness-hb] 136377  failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 3.299s (given timeout 3s): result is ambiguous: context done during DistSender.Send: ba: ConditionalPut [/System/NodeLiveness/5,/Min), EndTxn(commit modified-span (node-liveness)) [/System/NodeLiveness/5], [txn: e6b2d71d], [can-forward-ts] RPC error: grpc: context deadline exceeded [code 4/DeadlineExceeded]

// The original proposal (containing the transaction) pops up in raft,
// and the tracing span is used despite having been finished.
E230812 20:04:53.819675 6359 1@util/log/logcrash/crash_reporting.go:189  [T1,n4] 136382  a panic has occurred!
E230812 20:04:53.819675 6359 1@util/log/logcrash/crash_reporting.go:189  [T1,n4] 136382 +use of Span after Finish.
...
E230812 20:04:53.819675 6359 1@util/log/logcrash/crash_reporting.go:189  [T1,n4] 136382 +  | details: ConditionalPut [/System/NodeLiveness/5,/Min), EndTxn(commit modified-span (node-liveness)) [/System/NodeLiveness/5], [txn: e6b2d71d], [can-forward-ts]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). X-duplicate Closed as a duplicate of another issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants