-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: fix bug where bad mutation job state could block dropping tables #57836
sql: fix bug where bad mutation job state could block dropping tables #57836
Conversation
Ultimately I went for what seemed to be the simplest fix, which is to keep marking these jobs as succeeded whenever possible. I also realized that this was also broken if we have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @lucy-zhang)
pkg/sql/drop_table.go, line 470 at r1 (raw file):
// a failed state. Such jobs could have already been GCed from the jobs // table by the time this code runs. row, err := p.ExtendedEvalContext().InternalExecutor.(*InternalExecutor).QueryRowEx(
Is this really better than loading the job and reading its status using the jobs APIs?
87e13f2
to
0be5c57
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also added a test for when we're dropping a table with mutation jobs in the reverting state.
Reviewed 2 of 2 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)
pkg/sql/drop_table.go, line 470 at r1 (raw file):
Previously, ajwerner wrote…
Is this really better than loading the job and reading its status using the jobs APIs?
That's a fair point. I updated it to use UpdateJobWithTxn
. I also decided I didn't want to overwrite the original error message in the case of jobs that need to be marked as failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to go back to 20.1 as well?
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @lucy-zhang)
pkg/sql/schema_changer_test.go, line 6157 at r2 (raw file):
// Try to create a unique index which won't be valid and will need a rollback. _, err = sqlDB.Exec(`CREATE UNIQUE INDEX i ON t.test(v);`) require.Regexp(t, "violates unique constraint", err)
Use assert
in a goroutine.
pkg/sql/schema_changer_test.go, line 6187 at r2 (raw file):
ctx := context.Background() runTest := func(params base.TestServerArgs, gcJobRecord bool) {
pass in the testing.T
0be5c57
to
6505359
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a query before calling JobRegistry.UpdateJobWithTxn
to check that the job exists at all (which I don't think is possible through the registry API) and to not return an error in that case. Should be fixed now.
Does this need to go back to 20.1 as well?
Probably. I'll figure that out.
Reviewed 2 of 2 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner)
pkg/sql/schema_changer_test.go, line 6157 at r2 (raw file):
Previously, ajwerner wrote…
Use
assert
in a goroutine.
Done.
pkg/sql/schema_changer_test.go, line 6187 at r2 (raw file):
Previously, ajwerner wrote…
pass in the
testing.T
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained
TFTR! bors r+ |
I'm realizing now that this is not going to backport cleanly to 20.2 because bors r- |
Canceled. |
6505359
to
e2d9090
Compare
RFAL |
Previously, while dropping a table, we would mark all the jobs associated with mutations on the table as `succeeded`, under the assumption that they were running. The job registry API prohibits this when the jobs are not `running` (or `pending`), so if a mutation was stuck on the table descriptor with a failed or nonexistent job, dropping the table would fail. This PR fixes the bug by checking the job state before attempting to update the job. It also fixes a related failure to drop a table caused by a valid mutation job not being in a `running` state. Release note (bug fix): Fixed a bug where prior schema changes on a table that failed and could not be fully reverted could prevent the table from being dropped.
e2d9090
to
94be025
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained
TFTR bors r+ |
Build succeeded: |
Previously, while dropping a table, we would mark all the jobs
associated with mutations on the table as
succeeded
, under theassumption that they were running. The job registry API prohibits this
when the jobs are not
running
(orpending
), so if a mutation wasstuck on the table descriptor with a failed or nonexistent job, dropping
the table would fail.
This PR fixes the bug by checking the job state before attempting to
update the job. It also fixes a related failure to drop a table caused
by a valid mutation job not being in a
running
state.Fixes #57597.
Release note (bug fix): Fixed a bug where prior schema changes on a
table that failed and could not be fully reverted could prevent the
table from being dropped.