Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: fix bug where bad mutation job state could block dropping tables #57836

Merged

Conversation

thoszhang
Copy link
Contributor

Previously, while dropping a table, we would mark all the jobs
associated with mutations on the table as succeeded, under the
assumption that they were running. The job registry API prohibits this
when the jobs are not running (or pending), so if a mutation was
stuck on the table descriptor with a failed or nonexistent job, dropping
the table would fail.

This PR fixes the bug by checking the job state before attempting to
update the job. It also fixes a related failure to drop a table caused
by a valid mutation job not being in a running state.

Fixes #57597.

Release note (bug fix): Fixed a bug where prior schema changes on a
table that failed and could not be fully reverted could prevent the
table from being dropped.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@thoszhang
Copy link
Contributor Author

Ultimately I went for what seemed to be the simplest fix, which is to keep marking these jobs as succeeded whenever possible. I also realized that this was also broken if we have reverting mutation jobs. That could use a separate test.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @lucy-zhang)


pkg/sql/drop_table.go, line 470 at r1 (raw file):

		// a failed state. Such jobs could have already been GCed from the jobs
		// table by the time this code runs.
		row, err := p.ExtendedEvalContext().InternalExecutor.(*InternalExecutor).QueryRowEx(

Is this really better than loading the job and reading its status using the jobs APIs?

@thoszhang thoszhang force-pushed the fix-marking-jobs-succeeded-in-drop-table branch 2 times, most recently from 87e13f2 to 0be5c57 Compare December 22, 2020 05:59
Copy link
Contributor Author

@thoszhang thoszhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added a test for when we're dropping a table with mutation jobs in the reverting state.

Reviewed 2 of 2 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/sql/drop_table.go, line 470 at r1 (raw file):

Previously, ajwerner wrote…

Is this really better than loading the job and reading its status using the jobs APIs?

That's a fair point. I updated it to use UpdateJobWithTxn. I also decided I didn't want to overwrite the original error message in the case of jobs that need to be marked as failed.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Does this need to go back to 20.1 as well?

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @lucy-zhang)


pkg/sql/schema_changer_test.go, line 6157 at r2 (raw file):

		// Try to create a unique index which won't be valid and will need a rollback.
		_, err = sqlDB.Exec(`CREATE UNIQUE INDEX i ON t.test(v);`)
		require.Regexp(t, "violates unique constraint", err)

Use assert in a goroutine.


pkg/sql/schema_changer_test.go, line 6187 at r2 (raw file):

	ctx := context.Background()

	runTest := func(params base.TestServerArgs, gcJobRecord bool) {

pass in the testing.T

@thoszhang thoszhang force-pushed the fix-marking-jobs-succeeded-in-drop-table branch from 0be5c57 to 6505359 Compare December 23, 2020 20:04
Copy link
Contributor Author

@thoszhang thoszhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a query before calling JobRegistry.UpdateJobWithTxn to check that the job exists at all (which I don't think is possible through the registry API) and to not return an error in that case. Should be fixed now.

Does this need to go back to 20.1 as well?

Probably. I'll figure that out.

Reviewed 2 of 2 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner)


pkg/sql/schema_changer_test.go, line 6157 at r2 (raw file):

Previously, ajwerner wrote…

Use assert in a goroutine.

Done.


pkg/sql/schema_changer_test.go, line 6187 at r2 (raw file):

Previously, ajwerner wrote…

pass in the testing.T

Done.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

@thoszhang
Copy link
Contributor Author

TFTR!

bors r+

@thoszhang
Copy link
Contributor Author

I'm realizing now that this is not going to backport cleanly to 20.2 because UpdateJobWithTxn is only on master, and doesn't quite do what I want anyway. So I'm going to rework this slightly and just load the job itself.

bors r-

@craig
Copy link
Contributor

craig bot commented Dec 23, 2020

Canceled.

@thoszhang thoszhang force-pushed the fix-marking-jobs-succeeded-in-drop-table branch from 6505359 to e2d9090 Compare December 23, 2020 20:40
@thoszhang
Copy link
Contributor Author

RFAL

Previously, while dropping a table, we would mark all the jobs
associated with mutations on the table as `succeeded`, under the
assumption that they were running. The job registry API prohibits this
when the jobs are not `running` (or `pending`), so if a mutation was
stuck on the table descriptor with a failed or nonexistent job, dropping
the table would fail.

This PR fixes the bug by checking the job state before attempting to
update the job. It also fixes a related failure to drop a table caused
by a valid mutation job not being in a `running` state.

Release note (bug fix): Fixed a bug where prior schema changes on a
table that failed and could not be fully reverted could prevent the
table from being dropped.
@thoszhang thoszhang force-pushed the fix-marking-jobs-succeeded-in-drop-table branch from e2d9090 to 94be025 Compare December 23, 2020 23:31
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r4.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

@thoszhang
Copy link
Contributor Author

TFTR

bors r+

@craig
Copy link
Contributor

craig bot commented Dec 24, 2020

Build succeeded:

@craig craig bot merged commit 80d24ed into cockroachdb:master Dec 24, 2020
@thoszhang thoszhang deleted the fix-marking-jobs-succeeded-in-drop-table branch January 7, 2021 19:42
@rafiss rafiss added this to the 20.2 milestone Apr 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: tables with abandoned schema change mutations with failed jobs cannot be dropped
4 participants