Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/gcjob: make index GC robust to descriptors being deleted #86696

Merged
merged 1 commit into from
Sep 9, 2022

Conversation

ajwerner
Copy link
Contributor

First commit is #86690

If the descriptor was deleted, the GC job should exit gracefully.

Fixes #86340

Release justification: bug fix for backport

Release note (bug fix): In some scenarios, when a DROP INDEX was
run around the same time as a DROP TABLE or DROP DATABASE covering the same
data, the DROP INDEX gc job could get caught retrying indefinitely. This
has been fixed.

@ajwerner ajwerner requested a review from a team August 23, 2022 19:39
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ajwerner ajwerner force-pushed the ajwerner/gcjob-index-robustness branch 2 times, most recently from 4ec7b76 to f181f82 Compare August 23, 2022 21:11
// Before deleting any indexes, ensure that old versions of the table descriptor
// are no longer in use. This is necessary in the case of truncate, where we
// schedule a GC Job in the transaction that commits the truncation.
parentDesc, err := sql.WaitToUpdateLeases(ctx, execCfg.LeaseManager, parentID)
if maybeHandleDeletedDescriptor(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like a eaiser to understand implementation is

		if errors.Is(err, catalog.ErrDescriptorNotFound) {
		  // If the descriptor has been removed, then we need to assume that the relevant
		  // zone configs and data have been cleaned up by another process.
			handleDeletedDescriptor()
      return
		}

and accordingly, change the other function to

	handleDeletedDescriptor := func() {
		log.Infof(ctx, "descriptor %d dropped, assuming another process has handled GC", parentID)
		for _, index := range droppedIndexes {
			markIndexGCed(
				ctx, index.IndexID, progress, jobspb.SchemaChangeGCProgress_CLEARED,
			)
		}
		return true
	}

But I don't really have a strong opinion on this; either way is perfectly fine with me.

@@ -84,11 +102,28 @@ func gcIndexes(
if log.V(2) {
log.Infof(ctx, "GC is being considered on table %d for indexes indexes: %+v", parentID, droppedIndexes)
}
maybeHandleDeletedDescriptor := func(err error) (done bool) {
Copy link
Contributor

@Xiang-Gu Xiang-Gu Aug 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reused so maybe consider making it a standalone function?

Comment on lines 548 to 553
jobID := <-gcJobID
go func() {
k := catalogkeys.MakeDescMetadataKey(codec, tableID)
_, err := kvDB.Del(ctx, k)
errCh <- err
}()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the ordering of those two steps reversed?
The DROP INDEX stmt above will ensure we stall the index-gc job (due to the testing knob). We want to test the scenario where the the index-gc job failed to find the table descriptor, so shouldn't we drop the table first before we allow the index-gc job to proceed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no guarantee of ordering between the two. I don't think it changes the test to say go first, but I'll do it to clarify intention.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still didn't understand this one: gcJobID is a 0-capacity channel. so jobID := <- jcJobID will unblock the DROP INDEX first and then we schedule a go routine that will delete the descriptor from system.descriptor table, which makes the DROP INDEX gc-job may or may not observe the deletion.

Shouldn't we make sure that we deleted the descriptor before we "let go" of the DROP INDEX index-gc job, which ensures that the gc-index job will observe the deletion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The job has a bunch of work to do before it gets to the point of checking whether or not the descriptor is there. I guess what I'd say is if I switched the order, it would probably not affect the test much. Both statements more or less just make a goroutine runnable. I promise this test failed before this patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this test were better, it'd intercept behavior of the gc job and delete the descriptor at specific moments, but I didn't feel that it was worth it, so I just ran the test on stress for a few minutes and called it a day.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the test better.

@ajwerner ajwerner force-pushed the ajwerner/gcjob-index-robustness branch 2 times, most recently from aef7663 to 0c38b19 Compare September 1, 2022 21:42
@Xiang-Gu
Copy link
Contributor

Xiang-Gu commented Sep 6, 2022

@ajwerner I didn't forget about this one; Let me know once you added more commentary on the test case, and I'll take a look again.

If the descriptor was deleted, the GC job should exit gracefully.

Fixes cockroachdb#86340

Release justification: bug fix for backport

Release note (bug fix): In some scenarios, when a DROP INDEX was
run around the same time as a DROP TABLE or DROP DATABASE covering the same
data, the `DROP INDEX` gc job could get caught retrying indefinitely. This
has been fixed.
@ajwerner ajwerner force-pushed the ajwerner/gcjob-index-robustness branch from 0c38b19 to ed2e090 Compare September 7, 2022 20:51
@ajwerner
Copy link
Contributor Author

ajwerner commented Sep 7, 2022

@Xiang-Gu I added more commentary, PTAL

// the DeleteRange operation. To do this, we install the below testing knob.
if !beforeDelRange {
knobs.Store = &kvserver.StoreTestingKnobs{
TestingRequestFilter: func(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When beforeDelRange=false, this knob is installed but this knob is called before evaluating the deleteRange request, which means we delete the descriptor before evaluating the deleteRange. This seems to not achieve what you said in the comment above "... descriptor being removed both before the initial DelRange, and after, when going to remove the zone config".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, it's before evaluating DeleteRange, but after the code which looked up the descriptor in order to build the DeleteRange.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got ya, thanks for explaining, LGTM!

// the DeleteRange operation. To do this, we install the below testing knob.
if !beforeDelRange {
knobs.Store = &kvserver.StoreTestingKnobs{
TestingRequestFilter: func(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got ya, thanks for explaining, LGTM!

@ajwerner
Copy link
Contributor Author

ajwerner commented Sep 8, 2022

TFTR!

bors r+

@craig
Copy link
Contributor

craig bot commented Sep 8, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Sep 8, 2022

Build failed:

@ajwerner
Copy link
Contributor Author

ajwerner commented Sep 9, 2022

bors r+

@craig
Copy link
Contributor

craig bot commented Sep 9, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Sep 9, 2022

Build failed:

@ajwerner
Copy link
Contributor Author

ajwerner commented Sep 9, 2022

bors r+

@craig
Copy link
Contributor

craig bot commented Sep 9, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Sep 9, 2022

Build succeeded:

@blathers-crl
Copy link

blathers-crl bot commented Sep 9, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from ed2e090 to blathers/backport-release-22.1-86696: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gcjob: index GC can retry perpetually when racing with table GC
3 participants