Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v22.1.1: sql: alter table attempted to update job for mutation 2, but job already exists with mutation 1 #82921

Closed
cockroach-teamcity opened this issue Jun 15, 2022 · 6 comments · Fixed by #92300
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jun 15, 2022

This issue was autofiled by Sentry. It represents a crash or reported error on a live cluster with telemetry enabled.

Sentry link: https://sentry.io/organizations/cockroach-labs/issues/3350717600/?referrer=webhooks_plugin

Panic message:

table.go:202: attempted to update job for mutation 2, but job already exists with mutation 1
(1)
Wraps: (2) assertion failure
Wraps: (3) attached stack trace
-- stack trace:
| github.com/cockroachdb/cockroach/pkg/sql.(*planner).createOrUpdateSchemaChangeJob
| github.com/cockroachdb/cockroach/pkg/sql/table.go:202
| github.com/cockroachdb/cockroach/pkg/sql.(*planner).writeSchemaChange
| github.com/cockroachdb/cockroach/pkg/sql/table.go:248
| github.com/cockroachdb/cockroach/pkg/sql.(*planner).dropIndexByName
| github.com/cockroachdb/cockroach/pkg/sql/drop_index.go:554
| github.com/cockroachdb/cockroach/pkg/sql.(*dropIndexNode).startExec
| github.com/cockroachdb/cockroach/pkg/sql/drop_index.go:162
| github.com/cockroachdb/cockroach/pkg/sql.startExec.func2
| github.com/cockroachdb/cockroach/pkg/sql/plan.go:516
| github.com/cockroachdb/cockroach/pkg/sql.(*planVisitor).visitInternal.func1
| github.com/cockroachdb/cockroach/pkg/sql/walk.go:112
| github.com/cockroachdb/cockroach/pkg/sql.(*planVisitor).visitInternal
| github.com/cockroachdb/cockroach/pkg/sql/walk.go:297
| github.com/cockroachdb/cockroach/pkg/sql.(*planVisitor).visit
| github.com/cockroachdb/cockroach/pkg/sql/walk.go:79
| github.com/cockroachdb/cockroach/pkg/sql.walkPlan
| github.com/cockroachdb/cockroach/pkg/sql/walk.go:43
| github.com/cockroachdb/cockroach/pkg/sql.startExec
| github.com/cockroachdb/cockroach/pkg/sql/plan.go:519
| github.com/cockroachdb/cockroach/pkg/sql.(*planNodeToRowSource).Start
| github.com/cockroachdb/cockroach/pkg/sql/plan_node_to_row_source.go:146
| github.com/cockroachdb/cockroach/pkg/sql/colexec.(*Columnarizer).Init
| github.com/cockroachdb/cockroach/pkg/sql/colexec/columnarizer.go:158
| github.com/cockroachdb/cockroach/pkg/sql/colflow.(*batchInfoCollector).Init
| github.com/cockroachdb/cockroach/pkg/sql/colflow/stats.go:90
| github.com/cockroachdb/cockroach/pkg/sql/colflow.(*BatchFlowCoordinator).init.func1
| github.com/cockroachdb/cockroach/pkg/sql/colflow/flow_coordinator.go:235
| github.com/cockroachdb/cockroach/pkg/sql/colexecerror.CatchVectorizedRuntimeError
| github.com/cockroachdb/cockroach/pkg/sql/colexecerror/error.go:91
| github.com/cockroachdb/cockroach/pkg/sql/colflow.(*BatchFlowCoordinator).init
| github.com/cockroachdb/cockroach/pkg/sql/colflow/flow_coordinator.go:234
| github.com/cockroachdb/cockroach/pkg/sql/colflow.(*BatchFlowCoordinator).Run
| github.com/cockroachdb/cockroach/pkg/sql/colflow/flow_coordinator.go:279
| github.com/cockroachdb/cockroach/pkg/sql/colflow.(*vectorizedFlow).Run
| github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go:260
| github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).Run
| github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:597
| github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).PlanAndRun
| github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:1445
| github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execWithDistSQLEngine
| github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1467
| github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).dispatchToExecutionEngine
| github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1143
| github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState
| github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:686
| github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt
| github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:143
| github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execPortal
| github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:231
| github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd.func2
| github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1952
| github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd
| github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1954
| github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run
| github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1800
| github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn
| github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:748
| github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync.func1
| github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:724
| runtime.goexit
| src/runtime/asm_amd64.s:1581
Wraps: (4) attempted to update job for mutation 2, but job already exists with mutation 1
Error types: (1) *colexecerror.notInternalError (2) *assert.withAssertionFailure (3) *withstack.withStack (4) *errutil.leafError
-- report composition:
*errutil.leafError: attempted to update job for mutation 2, but job already exists with mutation 1
table.go:202: *withstack.withStack (top exception)
*assert.withAssertionFailure
*colexecerror.notInternalError

Stacktrace (expand for inline code snippets):

cockroach/pkg/sql/table.go

Lines 201 to 203 in 242d13b

if mutationID != descpb.InvalidMutationID && mutationID != oldDetails.TableMutationID {
return errors.AssertionFailedf(
"attempted to update job for mutation %d, but job already exists with mutation %d",
in pkg/sql.(*planner).createOrUpdateSchemaChangeJob

cockroach/pkg/sql/table.go

Lines 247 to 249 in 242d13b

if !tableDesc.IsNew() {
if err := p.createOrUpdateSchemaChangeJob(ctx, tableDesc, jobDesc, mutationID); err != nil {
return err
in pkg/sql.(*planner).writeSchemaChange
mutationID := tableDesc.ClusterVersion().NextMutationID
if err := p.writeSchemaChange(ctx, tableDesc, mutationID, jobDesc); err != nil {
return err
in pkg/sql.(*planner).dropIndexByName
// different index than the one being dropped.
if err := params.p.dropIndexByName(
ctx, index.tn, index.idxName, tableDesc, n.n.IfExists, n.n.DropBehavior, checkIdxConstraint,
in pkg/sql.(*dropIndexNode).startExec

cockroach/pkg/sql/plan.go

Lines 515 to 517 in 242d13b

}
return n.startExec(params)
},
in pkg/sql.startExec.func2

cockroach/pkg/sql/walk.go

Lines 111 to 113 in 242d13b

}
v.err = v.observer.leaveNode(name, plan)
}()
in pkg/sql.(*planVisitor).visitInternal.func1

cockroach/pkg/sql/walk.go

Lines 296 to 298 in 242d13b

}
}
in pkg/sql.(*planVisitor).visitInternal
}
v.visitInternal(plan, name)
return plan
in pkg/sql.(*planVisitor).visit
v := makePlanVisitor(ctx, observer)
v.visit(plan)
return v.err
in pkg/sql.walkPlan

cockroach/pkg/sql/plan.go

Lines 518 to 520 in 242d13b

}
return walkPlan(params.ctx, plan, o)
}
in pkg/sql.startExec
// This starts all of the nodes below this node.
if err := startExec(p.params, p.node); err != nil {
p.MoveToDraining(err)
in pkg/sql.(*planNodeToRowSource).Start
ctx = c.StartInternalNoSpan(ctx)
c.input.Start(ctx)
if execStatsHijacker, ok := c.input.(execinfra.ExecStatsForTraceHijacker); ok {
in pkg/sql/colexec.(*Columnarizer).Init
func (bic *batchInfoCollector) Init(ctx context.Context) {
bic.Operator.Init(ctx)
bic.mu.Lock()
in pkg/sql/colflow.(*batchInfoCollector).Init
return colexecerror.CatchVectorizedRuntimeError(func() {
f.input.Root.Init(ctx)
})
in pkg/sql/colflow.(*BatchFlowCoordinator).init.func1
}()
operation()
return retErr
in pkg/sql/colexecerror.CatchVectorizedRuntimeError
func (f *BatchFlowCoordinator) init(ctx context.Context) error {
return colexecerror.CatchVectorizedRuntimeError(func() {
f.input.Root.Init(ctx)
in pkg/sql/colflow.(*BatchFlowCoordinator).init
if err := f.init(ctx); err != nil {
f.pushError(err)
in pkg/sql/colflow.(*BatchFlowCoordinator).Run
log.VEvent(ctx, 1, "running the batch flow coordinator in the flow's goroutine")
f.batchFlowCoordinator.Run(ctx)
}
in pkg/sql/colflow.(*vectorizedFlow).Run
// TODO(radu): this should go through the flow scheduler.
flow.Run(ctx, func() {})
in pkg/sql.(*DistSQLPlanner).Run
recv.expectedRowsRead = int64(physPlan.TotalEstimatedScannedRows)
runCleanup := dsp.Run(ctx, planCtx, txn, physPlan, recv, evalCtx, nil /* finishedSetupFn */)
return func() {
in pkg/sql.(*DistSQLPlanner).PlanAndRun
// the planner whether or not to plan remote table readers.
cleanup := ex.server.cfg.DistSQLPlanner.PlanAndRun(
ctx, evalCtx, planCtx, planner.txn, planner.curPlan.main, recv,
in pkg/sql.(*connExecutor).execWithDistSQLEngine
ex.sessionTracing.TraceExecStart(ctx, "distributed")
stats, err := ex.execWithDistSQLEngine(
ctx, planner, stmt.AST.StatementReturnType(), res, distribute, progAtomic,
in pkg/sql.(*connExecutor).dispatchToExecutionEngine
if err := ex.dispatchToExecutionEngine(stmtCtx, p, res); err != nil {
stmtThresholdSpan.Finish()
in pkg/sql.(*connExecutor).execStmtInOpenState
} else {
ev, payload, err = ex.execStmtInOpenState(ctx, parserStmt, prepared, pinfo, res, canAutoCommit)
}
in pkg/sql.(*connExecutor).execStmt
}
ev, payload, err = ex.execStmt(ctx, portal.Stmt.Statement, portal.Stmt, pinfo, stmtRes, canAutoCommit)
// Portal suspension is supported via a "side" state machine
in pkg/sql.(*connExecutor).execPortal
canAutoCommit := ex.implicitTxn() && tcmd.FollowedBySync
ev, payload, err = ex.execPortal(ctx, portal, portalName, stmtRes, pinfo, canAutoCommit)
return err
in pkg/sql.(*connExecutor).execCmd.func2
return err
}()
// Note: we write to ex.statsCollector.phaseTimes, instead of ex.phaseTimes,
in pkg/sql.(*connExecutor).execCmd
var err error
if err = ex.execCmd(); err != nil {
if errors.IsAny(err, io.EOF, errDrainingComplete) {
in pkg/sql.(*connExecutor).run
}()
return h.ex.run(ctx, s.pool, reserved, cancel)
}
in pkg/sql.(*Server).ServeConn
reservedOwned = false // We're about to pass ownership away.
retErr = sqlServer.ServeConn(ctx, connHandler, reserved, cancelConn)
}()
in pkg/sql/pgwire.(*conn).processCommandsAsync.func1
src/runtime/asm_amd64.s#L1580-L1582 in runtime.goexit

pkg/sql/table.go in pkg/sql.(*planner).createOrUpdateSchemaChangeJob at line 202
pkg/sql/table.go in pkg/sql.(*planner).writeSchemaChange at line 248
pkg/sql/drop_index.go in pkg/sql.(*planner).dropIndexByName at line 554
pkg/sql/drop_index.go in pkg/sql.(*dropIndexNode).startExec at line 162
pkg/sql/plan.go in pkg/sql.startExec.func2 at line 516
pkg/sql/walk.go in pkg/sql.(*planVisitor).visitInternal.func1 at line 112
pkg/sql/walk.go in pkg/sql.(*planVisitor).visitInternal at line 297
pkg/sql/walk.go in pkg/sql.(*planVisitor).visit at line 79
pkg/sql/walk.go in pkg/sql.walkPlan at line 43
pkg/sql/plan.go in pkg/sql.startExec at line 519
pkg/sql/plan_node_to_row_source.go in pkg/sql.(*planNodeToRowSource).Start at line 146
pkg/sql/colexec/columnarizer.go in pkg/sql/colexec.(*Columnarizer).Init at line 158
pkg/sql/colflow/stats.go in pkg/sql/colflow.(*batchInfoCollector).Init at line 90
pkg/sql/colflow/flow_coordinator.go in pkg/sql/colflow.(*BatchFlowCoordinator).init.func1 at line 235
pkg/sql/colexecerror/error.go in pkg/sql/colexecerror.CatchVectorizedRuntimeError at line 91
pkg/sql/colflow/flow_coordinator.go in pkg/sql/colflow.(*BatchFlowCoordinator).init at line 234
pkg/sql/colflow/flow_coordinator.go in pkg/sql/colflow.(*BatchFlowCoordinator).Run at line 279
pkg/sql/colflow/vectorized_flow.go in pkg/sql/colflow.(*vectorizedFlow).Run at line 260
pkg/sql/distsql_running.go in pkg/sql.(*DistSQLPlanner).Run at line 597
pkg/sql/distsql_running.go in pkg/sql.(*DistSQLPlanner).PlanAndRun at line 1445
pkg/sql/conn_executor_exec.go in pkg/sql.(*connExecutor).execWithDistSQLEngine at line 1467
pkg/sql/conn_executor_exec.go in pkg/sql.(*connExecutor).dispatchToExecutionEngine at line 1143
pkg/sql/conn_executor_exec.go in pkg/sql.(*connExecutor).execStmtInOpenState at line 686
pkg/sql/conn_executor_exec.go in pkg/sql.(*connExecutor).execStmt at line 143
pkg/sql/conn_executor_exec.go in pkg/sql.(*connExecutor).execPortal at line 231
pkg/sql/conn_executor.go in pkg/sql.(*connExecutor).execCmd.func2 at line 1952
pkg/sql/conn_executor.go in pkg/sql.(*connExecutor).execCmd at line 1954
pkg/sql/conn_executor.go in pkg/sql.(*connExecutor).run at line 1800
pkg/sql/conn_executor.go in pkg/sql.(*Server).ServeConn at line 748
pkg/sql/pgwire/conn.go in pkg/sql/pgwire.(*conn).processCommandsAsync.func1 at line 724
src/runtime/asm_amd64.s in runtime.goexit at line 1581
Tag Value
Cockroach Release v22.1.1
Cockroach SHA: 242d13b
Platform windows amd64
Distribution CCL
Environment v22.1.1
Command demo
Go Version ``
# of CPUs
# of Goroutines

Jira issue: CRDB-16731

@cockroach-teamcity cockroach-teamcity added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report. labels Jun 15, 2022
@i-e-b
Copy link

i-e-b commented Jun 15, 2022

This was during an attempt to run an automated integration test against a cockroach demo instance.

Running under Windows, from a powershell session:

./cockroach demo --disable-demo-license --no-example-database --insecure --sql-port 26258 --http-port 5020

The query being run was a large set of CREATE TABLE calls. I can provide the scripts privately if useful (note to self: credits DB setup)

@i-e-b
Copy link

i-e-b commented Jun 15, 2022

Smallest repro case I've got so far.
Start the 'demo' server using the command line above.
Run this statement as a Npgsql command:

CREATE TABLE "User"
(
    "UserID"              integer primary key
);

Then immediately run these statements as a Npgsql command:

ALTER TABLE "User" ADD "SponsorID" INT NULL;
CREATE INDEX User_SponsorID_IDX ON "User" ("SponsorID");

That seems to trigger the fault reliably.
If the CREATE INDEX ... statement is not run, the fault is not triggered.

If you run these statements interactively through the cockroach sql client, the fault is not triggered

@i-e-b
Copy link

i-e-b commented Jun 15, 2022

Running each of the statements as individual Npgsql commands does not trigger the issue.

i.e.

CREATE TABLE "User"
(
    "UserID"              integer primary key
);

then

ALTER TABLE "User" ADD "SponsorID" INT NULL;

then

CREATE INDEX User_SponsorID_IDX ON "User" ("SponsorID");

with no delays between.

This is a viable work-around for me right now.

@yuzefovich yuzefovich changed the title sentry: table.go:202: attempted to update job for mutation 2, but job already exists with mutation 1 (1) Wraps: (2) assertion failure Wraps: (3) attached stack trace -- stack trace: | github.com/cockroach... v22.1.1: sql: alter table attempted to update job for mutation 2, but job already exists with mutation 1 Jun 29, 2022
@blathers-crl blathers-crl bot added the T-sql-schema-deprecated Use T-sql-foundations instead label Jun 29, 2022
@yuzefovich
Copy link
Member

Seems like a dup of #82046, but here a user reached with a repro. cc @postamar

@postamar
Copy link
Contributor

Thanks for the repro! This definitely helps.

@ajwerner
Copy link
Contributor

The root cause here is this code:

// This is a huge kludge to deal with the fact that we're resolving types
// using a planner with a committed transaction. This ends up being almost
// okay because the execution is going to re-acquire leases on these types.
// Regardless, holding this lease is worse than not holding it. Users might
// expect to get type mismatch errors if a rename of the type occurred.
if ex.getTransactionState() == NoTxnStateStr {
ex.planner.Descriptors().ReleaseAll(ctx)
}

The code is wrong in two ways:

  1. We're never in execBind here in the real noTxnState since sql: use an implicit txn during the extended protocol #76792
  2. The getTransactionState() method confusingly return NoTxnStateStr when in an Implicit transaction.
    • I don't think anybody wants this, it's very old behavior.

The tl;dr is that that code block has fully rotted. It even knew it was a mistake when it was added.

craig bot pushed a commit that referenced this issue Nov 22, 2022
92300: sql: fix bug with multi-statement implicit txn schema changes and Bind r=ajwerner a=ajwerner

For legacy reasons, we were resetting the descriptor collection state in Bind if we thought we were not in a transaction. Since #76792, we're always in a transaction. You might think that'd mean that the logic would not run. Sadly, for other still unclear reasons, when in an implicit transaction `(*connExecutor).getTransactionState()` returns `NoTxnStateStr`. The end result was that we'd erroneously reset our descriptor state in the middle of a multi- statement implicit transaction if bind was invoked.

Fixes #82921

Release note (bug fix): Fixed a bug which could lead to errors when running multiple schema change statements in a single command using a driver that uses the extended pgwire protocol internally (Npgsql in .Net as an example). These errors would have the form "attempted to update job for mutation 2, but job already exists with mutation 1".

Co-authored-by: Andrew Werner <[email protected]>
@craig craig bot closed this as completed in d0f16e5 Nov 22, 2022
blathers-crl bot pushed a commit that referenced this issue Nov 22, 2022
For legacy reasons, we were resetting the descriptor collection state in Bind
if we thought we were not in a transaction. Since #76792, we're always in a
transaction. You might think that'd mean that the logic would not run. Sadly,
for other still unclear reasons, when in an implicit transaction
`(*connExecutor).getTransactionState()` returns `NoTxnStateStr`. The end result
was that we'd erroneously reset our descriptor state in the middle of a multi-
statement implicit transaction if bind was invoked.

Fixes #82921

Release note (bug fix): Fixed a bug which could lead to errors when running
multiple schema change statements in a single command using a driver that
uses the extended pgwire protocol internally (Npgsql in .Net as an example).
These errors would have the form "attempted to update job for mutation 2, but
job already exists with mutation 1".
blathers-crl bot pushed a commit that referenced this issue Nov 22, 2022
For legacy reasons, we were resetting the descriptor collection state in Bind
if we thought we were not in a transaction. Since #76792, we're always in a
transaction. You might think that'd mean that the logic would not run. Sadly,
for other still unclear reasons, when in an implicit transaction
`(*connExecutor).getTransactionState()` returns `NoTxnStateStr`. The end result
was that we'd erroneously reset our descriptor state in the middle of a multi-
statement implicit transaction if bind was invoked.

Fixes #82921

Release note (bug fix): Fixed a bug which could lead to errors when running
multiple schema change statements in a single command using a driver that
uses the extended pgwire protocol internally (Npgsql in .Net as an example).
These errors would have the form "attempted to update job for mutation 2, but
job already exists with mutation 1".
ajwerner added a commit that referenced this issue Dec 14, 2022
For legacy reasons, we were resetting the descriptor collection state in Bind
if we thought we were not in a transaction. Since #76792, we're always in a
transaction. You might think that'd mean that the logic would not run. Sadly,
for other still unclear reasons, when in an implicit transaction
`(*connExecutor).getTransactionState()` returns `NoTxnStateStr`. The end result
was that we'd erroneously reset our descriptor state in the middle of a multi-
statement implicit transaction if bind was invoked.

Fixes #82921

Release note (bug fix): Fixed a bug which could lead to errors when running
multiple schema change statements in a single command using a driver that
uses the extended pgwire protocol internally (Npgsql in .Net as an example).
These errors would have the form "attempted to update job for mutation 2, but
job already exists with mutation 1".
ajwerner added a commit that referenced this issue Dec 14, 2022
For legacy reasons, we were resetting the descriptor collection state in Bind
if we thought we were not in a transaction. Since #76792, we're always in a
transaction. You might think that'd mean that the logic would not run. Sadly,
for other still unclear reasons, when in an implicit transaction
`(*connExecutor).getTransactionState()` returns `NoTxnStateStr`. The end result
was that we'd erroneously reset our descriptor state in the middle of a multi-
statement implicit transaction if bind was invoked.

Fixes #82921

Release note (bug fix): Fixed a bug which could lead to errors when running
multiple schema change statements in a single command using a driver that
uses the extended pgwire protocol internally (Npgsql in .Net as an example).
These errors would have the form "attempted to update job for mutation 2, but
job already exists with mutation 1".
ajwerner added a commit that referenced this issue Dec 14, 2022
ajwerner added a commit that referenced this issue Dec 14, 2022
@exalate-issue-sync exalate-issue-sync bot added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-schema-deprecated Use T-sql-foundations instead labels May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants