-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql/gcjob: make index GC robust to descriptors being deleted #86696
sql/gcjob: make index GC robust to descriptors being deleted #86696
Conversation
4ec7b76
to
f181f82
Compare
// Before deleting any indexes, ensure that old versions of the table descriptor | ||
// are no longer in use. This is necessary in the case of truncate, where we | ||
// schedule a GC Job in the transaction that commits the truncation. | ||
parentDesc, err := sql.WaitToUpdateLeases(ctx, execCfg.LeaseManager, parentID) | ||
if maybeHandleDeletedDescriptor(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like a eaiser to understand implementation is
if errors.Is(err, catalog.ErrDescriptorNotFound) {
// If the descriptor has been removed, then we need to assume that the relevant
// zone configs and data have been cleaned up by another process.
handleDeletedDescriptor()
return
}
and accordingly, change the other function to
handleDeletedDescriptor := func() {
log.Infof(ctx, "descriptor %d dropped, assuming another process has handled GC", parentID)
for _, index := range droppedIndexes {
markIndexGCed(
ctx, index.IndexID, progress, jobspb.SchemaChangeGCProgress_CLEARED,
)
}
return true
}
But I don't really have a strong opinion on this; either way is perfectly fine with me.
@@ -84,11 +102,28 @@ func gcIndexes( | |||
if log.V(2) { | |||
log.Infof(ctx, "GC is being considered on table %d for indexes indexes: %+v", parentID, droppedIndexes) | |||
} | |||
maybeHandleDeletedDescriptor := func(err error) (done bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reused so maybe consider making it a standalone function?
pkg/sql/gcjob_test/gc_job_test.go
Outdated
jobID := <-gcJobID | ||
go func() { | ||
k := catalogkeys.MakeDescMetadataKey(codec, tableID) | ||
_, err := kvDB.Del(ctx, k) | ||
errCh <- err | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the ordering of those two steps reversed?
The DROP INDEX
stmt above will ensure we stall the index-gc job (due to the testing knob). We want to test the scenario where the the index-gc job failed to find the table descriptor, so shouldn't we drop the table first before we allow the index-gc job to proceed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no guarantee of ordering between the two. I don't think it changes the test to say go
first, but I'll do it to clarify intention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still didn't understand this one: gcJobID is a 0-capacity channel. so jobID := <- jcJobID
will unblock the DROP INDEX
first and then we schedule a go routine that will delete the descriptor from system.descriptor
table, which makes the DROP INDEX
gc-job may or may not observe the deletion.
Shouldn't we make sure that we deleted the descriptor before we "let go" of the DROP INDEX
index-gc job, which ensures that the gc-index job will observe the deletion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The job has a bunch of work to do before it gets to the point of checking whether or not the descriptor is there. I guess what I'd say is if I switched the order, it would probably not affect the test much. Both statements more or less just make a goroutine runnable. I promise this test failed before this patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this test were better, it'd intercept behavior of the gc job and delete the descriptor at specific moments, but I didn't feel that it was worth it, so I just ran the test on stress for a few minutes and called it a day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the test better.
aef7663
to
0c38b19
Compare
@ajwerner I didn't forget about this one; Let me know once you added more commentary on the test case, and I'll take a look again. |
If the descriptor was deleted, the GC job should exit gracefully. Fixes cockroachdb#86340 Release justification: bug fix for backport Release note (bug fix): In some scenarios, when a DROP INDEX was run around the same time as a DROP TABLE or DROP DATABASE covering the same data, the `DROP INDEX` gc job could get caught retrying indefinitely. This has been fixed.
0c38b19
to
ed2e090
Compare
@Xiang-Gu I added more commentary, PTAL |
// the DeleteRange operation. To do this, we install the below testing knob. | ||
if !beforeDelRange { | ||
knobs.Store = &kvserver.StoreTestingKnobs{ | ||
TestingRequestFilter: func( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When beforeDelRange=false
, this knob is installed but this knob is called before evaluating the deleteRange
request, which means we delete the descriptor before evaluating the deleteRange
. This seems to not achieve what you said in the comment above "... descriptor being removed both before the initial DelRange, and after, when going to remove the zone config".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, it's before evaluating DeleteRange
, but after the code which looked up the descriptor in order to build the DeleteRange
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got ya, thanks for explaining, LGTM!
// the DeleteRange operation. To do this, we install the below testing knob. | ||
if !beforeDelRange { | ||
knobs.Store = &kvserver.StoreTestingKnobs{ | ||
TestingRequestFilter: func( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got ya, thanks for explaining, LGTM!
TFTR! bors r+ |
Build failed (retrying...): |
Build failed: |
bors r+ |
Build failed (retrying...): |
Build failed: |
bors r+ |
Build failed (retrying...): |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from ed2e090 to blathers/backport-release-22.1-86696: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
First commit is #86690
If the descriptor was deleted, the GC job should exit gracefully.
Fixes #86340
Release justification: bug fix for backport
Release note (bug fix): In some scenarios, when a DROP INDEX was
run around the same time as a DROP TABLE or DROP DATABASE covering the same
data, the
DROP INDEX
gc job could get caught retrying indefinitely. Thishas been fixed.