Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvclient: v20.2.2: transaction unexpectedly committed #57552

Closed
cockroach-teamcity opened this issue Dec 4, 2020 · 6 comments
Closed

kvclient: v20.2.2: transaction unexpectedly committed #57552

cockroach-teamcity opened this issue Dec 4, 2020 · 6 comments
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report. T-kv KV Team

Comments

@cockroach-teamcity
Copy link
Member

This issue was autofiled by Sentry. It represents a crash or reported error on a live cluster with telemetry enabled.

Sentry link: https://sentry.io/organizations/cockroach-labs/issues/2071814981/?referrer=webhooks_plugin

Panic message:

txn_coord_sender.go:823: log.Fatal: transaction unexpectedly committed: ×. ba: ×. txn: meta={id=9b6dfc5b pri=0.00922168 epo=0 ts=1607066927.789567810,0 min=1607066927.789567810,0 seq=4} lock=true stat=COMMITTED rts=0,0 wto=false max=0,0 int=3.
--
*errutil.leafError: log.Fatal: transaction unexpectedly committed: ×. ba: ×. txn: meta={id=9b6dfc5b pri=0.00922168 epo=0 ts=1607066927.789567810,0 min=1607066927.789567810,0 seq=4} lock=true stat=COMMITTED rts=0,0 wto=false max=0,0 int=3. (1)
txn_coord_sender.go:823: *withstack.withStack (top exception)
(check the extra data payloads)

Stacktrace (expand for inline code snippets):

// committed.
log.Fatalf(ctx, "transaction unexpectedly committed: %s. ba: %s. txn: %s.", pErr, ba, errTxn)
}
in pkg/kv/kvclient/kvcoord.sanityCheckCommittedErr
if errTxn.Status == roachpb.COMMITTED {
sanityCheckCommittedErr(ctx, pErr, ba)
}
in pkg/kv/kvclient/kvcoord.(*TxnCoordSender).updateStateLocked
pErr = tc.updateStateLocked(ctx, ba, br, pErr)
in pkg/kv/kvclient/kvcoord.(*TxnCoordSender).Send

cockroach/pkg/kv/db.go

Lines 743 to 745 in 92d9495

tracing.AnnotateTrace()
br, pErr := sender.Send(ctx, ba)
if pErr != nil {
in pkg/kv.(*DB).sendUsingSender

cockroach/pkg/kv/txn.go

Lines 918 to 920 in 92d9495

txn.mu.Unlock()
br, pErr := txn.db.sendUsingSender(ctx, ba, sender)
if pErr == nil {
in pkg/kv.(*Txn).Send

cockroach/pkg/kv/db.go

Lines 653 to 655 in 92d9495

ba.Header = b.Header
b.response, b.pErr = send(ctx, ba)
b.fillResults(ctx)
in pkg/kv.sendAndFill

cockroach/pkg/kv/txn.go

Lines 589 to 591 in 92d9495

}
return sendAndFill(ctx, txn.Send, b)
}
in pkg/kv.(*Txn).Run

cockroach/pkg/kv/txn.go

Lines 651 to 653 in 92d9495

b.initResult(1 /* calls */, 0, b.raw, nil)
return txn.Run(ctx, b)
}
in pkg/kv.(*Txn).CommitInBatch
// coordinator.
err = tb.txn.CommitInBatch(ctx, tb.b)
} else {
in pkg/sql.(*tableWriterBase).finalize

cockroach/pkg/sql/upsert.go

Lines 123 to 125 in 92d9495

if lastBatch {
if err := n.run.tw.finalize(params.ctx); err != nil {
return false, err
in pkg/sql.(*upsertNode).BatchedNext
for {
if next, err := r.source.BatchedNext(params); !next {
return err
in pkg/sql.(*rowCountNode).startExec

cockroach/pkg/sql/plan.go

Lines 513 to 515 in 92d9495

}
return n.startExec(params)
},
in pkg/sql.startExec.func2

cockroach/pkg/sql/walk.go

Lines 118 to 120 in 92d9495

}
v.err = v.observer.leaveNode(name, plan)
}()
in pkg/sql.(*planVisitor).visitInternal.func1

cockroach/pkg/sql/walk.go

Lines 297 to 299 in 92d9495

}
}
in pkg/sql.(*planVisitor).visitInternal
}
v.visitInternal(plan, name)
return plan
in pkg/sql.(*planVisitor).visit
v := makePlanVisitor(ctx, observer)
v.visit(plan)
return v.err
in pkg/sql.walkPlan

cockroach/pkg/sql/plan.go

Lines 516 to 518 in 92d9495

}
return walkPlan(params.ctx, plan, o)
}
in pkg/sql.startExec
// This starts all of the nodes below this node.
if err := startExec(p.params, p.node); err != nil {
p.MoveToDraining(err)
in pkg/sql.(*planNodeToRowSource).Start
}
ctx = pb.self.Start(ctx)
Run(ctx, pb.self, pb.Out.output)
in pkg/sql/execinfra.(*ProcessorBase).Run
}
headProc.Run(ctx)
return nil
in pkg/sql/flowinfra.(*FlowBase).Run
// TODO(radu): this should go through the flow scheduler.
if err := flow.Run(ctx, func() {}); err != nil {
log.Fatalf(ctx, "unexpected error from syncFlow.Start(): %v\n"+
in pkg/sql.(*DistSQLPlanner).Run
recv.expectedRowsRead = int64(physPlan.TotalEstimatedScannedRows)
return dsp.Run(planCtx, txn, physPlan, recv, evalCtx, nil /* finishedSetupFn */)
}
in pkg/sql.(*DistSQLPlanner).PlanAndRun
// the planner whether or not to plan remote table readers.
cleanup := ex.server.cfg.DistSQLPlanner.PlanAndRun(
ctx, evalCtx, planCtx, planner.txn, planner.curPlan.main, recv,
in pkg/sql.(*connExecutor).execWithDistSQLEngine
ex.sessionTracing.TraceExecStart(ctx, "distributed")
stats, err := ex.execWithDistSQLEngine(
ctx, planner, stmt.AST.StatementType(), res, distributePlan.WillDistribute(), progAtomic,
in pkg/sql.(*connExecutor).dispatchToExecutionEngine
p.autoCommit = os.ImplicitTxn.Get() && !ex.server.cfg.TestingKnobs.DisableAutoCommit
if err := ex.dispatchToExecutionEngine(ctx, p, res); err != nil {
return nil, nil, err
in pkg/sql.(*connExecutor).execStmtInOpenState
} else {
ev, payload, err = ex.execStmtInOpenState(ctx, stmt, res, pinfo)
}
in pkg/sql.(*connExecutor).execStmt
if !portal.exhausted {
ev, payload, err = ex.execStmt(stmtCtx, curStmt, stmtRes, pinfo)
// Portal suspension is supported via a "side" state machine
in pkg/sql.(*connExecutor).execPortal
res = stmtRes
ev, payload, err = ex.execPortal(ctx, portal, portalName, stmtRes, pinfo)
return err
in pkg/sql.(*connExecutor).execCmd.func2
return err
}()
// Note: we write to ex.statsCollector.phaseTimes, instead of ex.phaseTimes,
in pkg/sql.(*connExecutor).execCmd
var err error
if err = ex.execCmd(ex.Ctx()); err != nil {
if errors.IsAny(err, io.EOF, errDrainingComplete) {
in pkg/sql.(*connExecutor).run
}()
return h.ex.run(ctx, s.pool, reserved, cancel)
}
in pkg/sql.(*Server).ServeConn
reservedOwned = false // We're about to pass ownership away.
retErr = sqlServer.ServeConn(ctx, connHandler, reserved, cancelConn)
}()
in pkg/sql/pgwire.(*conn).processCommandsAsync.func1

pkg/kv/kvclient/kvcoord/txn_coord_sender.go in pkg/kv/kvclient/kvcoord.sanityCheckCommittedErr at line 823
pkg/kv/kvclient/kvcoord/txn_coord_sender.go in pkg/kv/kvclient/kvcoord.(*TxnCoordSender).updateStateLocked at line 796
pkg/kv/kvclient/kvcoord/txn_coord_sender.go in pkg/kv/kvclient/kvcoord.(*TxnCoordSender).Send at line 502
pkg/kv/db.go in pkg/kv.(*DB).sendUsingSender at line 744
pkg/kv/txn.go in pkg/kv.(*Txn).Send at line 919
pkg/kv/db.go in pkg/kv.sendAndFill at line 654
pkg/kv/txn.go in pkg/kv.(*Txn).Run at line 590
pkg/kv/txn.go in pkg/kv.(*Txn).CommitInBatch at line 652
pkg/sql/tablewriter.go in pkg/sql.(*tableWriterBase).finalize at line 149
pkg/sql/upsert.go in pkg/sql.(*upsertNode).BatchedNext at line 124
pkg/sql/plan_batch.go in pkg/sql.(*rowCountNode).startExec at line 159
pkg/sql/plan.go in pkg/sql.startExec.func2 at line 514
pkg/sql/walk.go in pkg/sql.(*planVisitor).visitInternal.func1 at line 119
pkg/sql/walk.go in pkg/sql.(*planVisitor).visitInternal at line 298
pkg/sql/walk.go in pkg/sql.(*planVisitor).visit at line 86
pkg/sql/walk.go in pkg/sql.walkPlan at line 50
pkg/sql/plan.go in pkg/sql.startExec at line 517
pkg/sql/plan_node_to_row_source.go in pkg/sql.(*planNodeToRowSource).Start at line 125
pkg/sql/execinfra/processorsbase.go in pkg/sql/execinfra.(*ProcessorBase).Run at line 774
pkg/sql/flowinfra/flow.go in pkg/sql/flowinfra.(*FlowBase).Run at line 392
pkg/sql/distsql_running.go in pkg/sql.(*DistSQLPlanner).Run at line 422
pkg/sql/distsql_running.go in pkg/sql.(*DistSQLPlanner).PlanAndRun at line 1002
pkg/sql/conn_executor_exec.go in pkg/sql.(*connExecutor).execWithDistSQLEngine at line 1001
pkg/sql/conn_executor_exec.go in pkg/sql.(*connExecutor).dispatchToExecutionEngine at line 872
pkg/sql/conn_executor_exec.go in pkg/sql.(*connExecutor).execStmtInOpenState at line 639
pkg/sql/conn_executor_exec.go in pkg/sql.(*connExecutor).execStmt at line 114
pkg/sql/conn_executor_exec.go in pkg/sql.(*connExecutor).execPortal at line 203
pkg/sql/conn_executor.go in pkg/sql.(*connExecutor).execCmd.func2 at line 1533
pkg/sql/conn_executor.go in pkg/sql.(*connExecutor).execCmd at line 1535
pkg/sql/conn_executor.go in pkg/sql.(*connExecutor).run at line 1391
pkg/sql/conn_executor.go in pkg/sql.(*Server).ServeConn at line 508
pkg/sql/pgwire/conn.go in pkg/sql/pgwire.(*conn).processCommandsAsync.func1 at line 626
Tag Value
Cockroach Release v20.2.2
Cockroach SHA: 92d9495
Platform linux amd64
Distribution CCL
Environment v20.2.2
Command server
Go Version ``
# of CPUs
# of Goroutines
@cockroach-teamcity cockroach-teamcity added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report. labels Dec 4, 2020
@yuzefovich yuzefovich changed the title sentry: txn_coord_sender.go:823: log.Fatal: transaction unexpectedly committed: ×. ba: ×. txn: meta={id=9b6dfc5b pri=0.00922168 epo=0 ts=1607066927.789567810,0 min=1607066927.789567810,0 seq=4} lock=true stat=COMMITTED rts=0,0 wto=false max=0,0 int=3. -- *errutil.leafError: log.Fatal: transaction unexpectedly committed: ×. ba: ×. txn: meta={id=9b6dfc5b pri=0.00922168 epo=0 ts=1607066927.789567810,0 min=1607066927.789567810,0 seq=4} lock=true stat=COMMITTED rts=0,0 wto=false max=0,0 int=3. (1) txn_coord_sender.go:823: *withstack.withStack (top exception) (check the extra data payloads) kvclient: v20.2.2: transaction unexpectedly committed Dec 4, 2020
@yuzefovich
Copy link
Member

cc @nvanbenschoten

@nvanbenschoten
Copy link
Member

There is some history behind this assertion. It was added in #47997, removed in #48227, and then re-added in #48297. It doesn't look like we got the assertion quite right, as people have hit it a few times on v20.2.x.

The unfortunate part about this is that both the error and the batch request are too redacted to meaningfully understand. @andreimatei do you have thoughts about what we can do here? Should we are least improve the redaction so we'll have a better idea of what's going wrong next time we see this?

@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
This was referenced Jul 15, 2021
@yuzefovich
Copy link
Member

yuzefovich commented Jul 15, 2021

Just a ping on this issue that we have seen a few more reports of this issue in the wild.

@andreimatei
Copy link
Contributor

Oh man, I don't know what's going on. The assertion says that an error is informing the client that the txn it was operating on is committed. That's not supposed to happen (if the txn committed, who's still operating on it?), except for some edge cases that are explicitly handled.
On all the reports are for 20.2. In 21.1, the batch in the assertion is not completely redacted any more, and I'd be very curious to see that batch. I say we just wait for a 21.1 report.

@andreimatei
Copy link
Contributor

We know a little bit about what's in the batch because of the stack traces - they all come from inserts and updates doing CommitInBatch. This is surprising - the request trying to commit gets an error and also commits?
What I really want to see is the error - and that too is supposed to be less redacted in 21.1.

@knz
Copy link
Contributor

knz commented Aug 12, 2021

Closing in favor of #67765

@knz knz closed this as completed Aug 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

6 participants