Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpccbench/nodes=3/cpu=16/mt-shared-process failed [replication span assertion; needs #107521] #107242

Closed
cockroach-teamcity opened this issue Jul 20, 2023 · 10 comments
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. A-kv-test-failure-complex A kv C-test-failure which requires a medium-large amount of work to address. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. X-noreuse Prevent automatic commenting from CI test failures
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jul 20, 2023

roachtest.tpccbench/nodes=3/cpu=16/mt-shared-process failed with artifacts on master @ c6f4c0ed6e39ec4795755b8b477e6cac0abf818f:

(cluster.go:2180).Start: ~ ./cockroach.sh
cockroach service already active
To get more information: systemctl status cockroach: COMMAND_PROBLEM: exit status 1
(test_runner.go:1122).func1: 2 dead node(s) detected
test artifacts and logs in: /artifacts/tpccbench/nodes=3/cpu=16/mt-shared-process/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=aws , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/test-eng

This test on roachdash | Improve this report!

Jira issue: CRDB-29957

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-testeng TestEng Team labels Jul 20, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Jul 20, 2023
@renatolabs
Copy link
Contributor

Basic analysis:

08:18:13 tpcc.go:1492: --- SEARCH ITER FAIL: TPCC 3500 resulted in 0.0 tpmC and failed due to monitor failure: monitor command failure: unexpected node event: 2: dead (exit status 134)

Over on n2:

E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 a panic has occurred!
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +use of Span after Finish. Span: ‹/cockroach.roachpb.Internal/Batch›. Finish previously called at: ‹<stack not captured. Set debugUseAfterFinish>›

Not sure if this is more on @cockroachdb/kv or @cockroachdb/multi-tenant.

@knz
Copy link
Contributor

knz commented Jul 21, 2023

@renatolabs could you help by including more context from the stack trace on n2? It will help us diagnose and route.

@tbg
Copy link
Member

tbg commented Jul 24, 2023

It's on Repl. Reassigning.

Details
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +(1) attached stack trace
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +  -- stack trace:
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +  | runtime.gopanic
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +  | 	GOROOT/src/runtime/panic.go:884
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +  | [...repeated from below...]
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +Wraps: (2) assertion failure
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +Wraps: (3) attached stack trace
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +  -- stack trace:
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +  | github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).detectUseAfterFinish
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +  | 	github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:182
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +  | github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).Tracer
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +  | 	github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:225
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaDecoder).createTracingSpans
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_decoder.go:154
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaDecoder).DecodeAndBind
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_decoder.go:63
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).Decode
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:142
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:878
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:728
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:670
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftSchedulerShard).worker
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:418
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).Start.func2
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:321
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
E230720 08:17:32.495120 672 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n2] 8214 +  | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484

@erikgrinaker erikgrinaker added T-kv-replication and removed T-testeng TestEng Team labels Jul 24, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jul 24, 2023

cc @cockroachdb/replication

@tbg
Copy link
Member

tbg commented Jul 24, 2023

Concretely, a log entry comes up for application. It's local, i.e. we have an entry in r.mu.proposals for it; that will be deleted and instead assigned to cmd.proposal. It has a cmd.proposal.ctx with a trace span in it. It is already finished and when we try to make a child from it, trip the use-after-finish assertion, on the StartSpanCtx line below:

if cmd.IsLocal() {
// We intentionally don't propagate the client's cancellation policy (in
// cmd.ctx) onto the request. See #75656.
propCtx := ctx // raft scheduler's ctx
var propSp *tracing.Span
// If the client has a trace, put a child into propCtx.
if sp := tracing.SpanFromContext(cmd.proposal.ctx); sp != nil {
propCtx, propSp = sp.Tracer().StartSpanCtx(
propCtx, "local proposal", tracing.WithParent(sp),
)
}
cmd.ctx, cmd.sp = propCtx, propSp

cmd.proposal.ctx is the caller's span (if no async consensus used) or it's a span derived from the caller's span (async consensus).

I think we're hitting the second path here. I say this because a) async consensus is common and b) the first path's handling of spans passes the smell check. Concretely, if the waiting caller wants to return early (i.e. before entry application), it calls the abandon closure below

abandon := func() {
// The proposal may or may not be in the Replica's proposals map.
// Instead of trying to look it up, simply modify the captured object
// directly. The raftMu must be locked to modify the context of a
// proposal because as soon as we propose a command to Raft, ownership
// passes to the "below Raft" machinery.
//
// See the comment on ProposalData.
r.raftMu.Lock()
defer r.raftMu.Unlock()
r.mu.Lock()
defer r.mu.Unlock()
// TODO(radu): Should this context be created via tracer.ForkSpan?
// We'd need to make sure the span is finished eventually.
proposal.ctx = r.AnnotateCtx(context.TODO())
}

and since this acquires raftMu (which is held throughout application) and then wholesale replaces the ctx field of the proposal with a dummy, there is no chance for the caller to return while also making the span accessible to entry application.

upd by @pavelkalinnikov

The above paragraph/assumption is false, and it's actually the problem. With the new reproposals #105625 code, the abandon func above is sometimes a no-op because it modifies the original proposal, while the reproposed copy of ProposalData still references the original context/span.

The other case, however, does have a weakness. Async consensus creates an extra span that must be managed by the apply pipeline: cmd.proposal.sp (nil or always stored in cmd.proposal.ctx), not to be confused with cmd.sp (which is always created in createTracingSpans and stored in cmd.ctx and has a life of its own).

Long story short, I think the problem here is that there are cases where cmd.proposal.sp is finished before the corresponding entry comes up for application, but without also replacing cmd.proposal.ctx, concretely:

func (proposal *ProposalData) finishApplication(ctx context.Context, pr proposalResult) {
proposal.ec.done(ctx, proposal.Request, pr.Reply, pr.Err)
proposal.signalProposalResult(pr)
if proposal.sp != nil {
proposal.sp.Finish()
proposal.sp = nil
}
}

This is called, for example, here:

if p.command.MaxLeaseIndex <= r.mu.state.LeaseAppliedIndex {
r.cleanupFailedProposalLocked(p)
log.Eventf(p.ctx, "retry proposal %x: %s", p.idKey, reason)
p.finishApplication(ctx, makeProposalResultErr(
kvpb.NewAmbiguousResultErrorf(
"unable to determine whether command was applied via snapshot",
)))
}
continue

but there are many other callers. The point is, that method doesn't replace ctx, i.e. it doesn't mirror what the abandon closure does in the case of non-async consensus. If finishApplication is called and the entry then comes up for replication, there'll be a stale cmd.proposal.ctx that, when the span within is accessed, explodes.

edit: realizing that the caller to finishApplication that I pointed out does remove the proposal from the map (and locks throughout, so there are no races with entry application). Either way, there are many callers. It's going to be some race like that. Cleaning up and simplifying until we find the bug seems to be the right course of action. We should also double check that these assertions are not on in prod builds.

@tbg tbg self-assigned this Jul 24, 2023
@tbg tbg added the A-kv-test-failure-complex A kv C-test-failure which requires a medium-large amount of work to address. label Jul 24, 2023
@tbg tbg changed the title roachtest: tpccbench/nodes=3/cpu=16/mt-shared-process failed roachtest: tpccbench/nodes=3/cpu=16/mt-shared-process failed [replication span assertion] Jul 25, 2023
@tbg
Copy link
Member

tbg commented Jul 25, 2023

Tracking the work needed to resolve this kind of issue in #107521. We can close this failure once that work is done (and we have found at least a theory of how this test failed).

Adding X-noreuse since the next time this test fails it is more likely than not with another failure mode.

@tbg tbg changed the title roachtest: tpccbench/nodes=3/cpu=16/mt-shared-process failed [replication span assertion] roachtest: tpccbench/nodes=3/cpu=16/mt-shared-process failed [replication span assertion; needs #107521] Jul 25, 2023
@tbg tbg added the X-noreuse Prevent automatic commenting from CI test failures label Jul 25, 2023
@pav-kv
Copy link
Collaborator

pav-kv commented Jul 27, 2023

There were lots of failed reproposals (found 297 in logs) prior to the panic, all on range r874:

...
I230720 08:17:32.654915 540 kv/kvserver/replica_application_result.go:219 ⋮ [T1,n1,s1,r874/1:‹/Tenant/2/Table/10{6/1…-7/1…}›,raft] 5850  failed to repropose 76692e14df124ecd at idx 187667 with new lease index: [NotLeaseHolderError] ‹reproposal failed due to closed timestamp›; r874: replica (n1,s1):1 not lease holder; current lease is repl=(n1,s1):1 seq=6 start=1689840878.394338902,0 epo=2 pro=1689840878.399903332,0
I230720 08:17:32.654979 540 kv/kvserver/replica_application_result.go:219 ⋮ [T1,n1,s1,r874/1:‹/Tenant/2/Table/10{6/1…-7/1…}›,raft] 5851  failed to repropose 65e2ac30d7e75f3c at idx 187676 with new lease index: [NotLeaseHolderError] ‹reproposal failed due to closed timestamp›; r874: replica (n1,s1):1 not lease holder; current lease is repl=(n1,s1):1 seq=6 start=1689840878.394338902,0 epo=2 pro=1689840878.399903332,0
E230720 08:17:32.707676 535 1@util/log/logcrash/crash_reporting.go:189 ⋮ [T1,n1] 5852  a panic has occurred!

Similar 53 errors for ranges r864 and r865 on node 2 (at the same 08:17:32 time), but none on node 3.

Upd: this is probably a coincidence or a symptom. In one of reproduction runs, I saw none of "failed to repropose" errors.

@irfansharif irfansharif added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-replication Relating to Raft, consensus, and coordination. labels Jul 28, 2023
@erikgrinaker
Copy link
Contributor

Might be related to #101721 (comment).

@renatolabs
Copy link
Contributor

This one as well: #107853.

@srosenberg
Copy link
Member

srosenberg commented Aug 4, 2023

Tracked as release blocker by #107521

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. A-kv-test-failure-complex A kv C-test-failure which requires a medium-large amount of work to address. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. X-noreuse Prevent automatic commenting from CI test failures
Projects
None yet
Development

No branches or pull requests

8 participants