-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: ensure descriptors versions change only once, fix related bugs #79697
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is good. Seems like some tests' expectations need adjusting.
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)
This turns out to be too simplistic. Added #79701 to take some pressure off. |
…ange It's a bug any time we increment a version more than once in the context of a single descriptor collection. Let's enforce it. Release note: None
2faa07b
to
8f5f077
Compare
40a9157
to
8828763
Compare
…database As soon as cockroachdb#79580 merged, it tickled flakes. These flakes were caused by operations to alter the database which would build new mutable tables from immutable tables and thus overwrite existing entries. The fix here is to retrieve the proper descriptors using the collection, and to make sure that those descriptors are properly hydrated. This, in turn, revealed that our policy checking in validation to avoid using descriptors which had been already checked out was too severe and would cause excessive numbers of extra round-trip. I'll be honest, I haven't fully internalized why that policy was there. I supposed it was there to ensure that we didn't have a case where we check out a descriptor and then fail to write it to the store. I don't exactly know what to do to re-establish the desired behavior of the code. At the very least, it feels like if we did re-read the descriptor once, then we should do something about resetting the status. I guess I could try to unravel the mystery leading to the checkout in the first place. The test is very flakey without this patch. Release note: None
I don't have evidence that these were causing bugs, but they might have been. Whenever we're doing a schema change, we need to avoid leasing. All of this hints at a severe abstraction problem. It shouldn't be possible to get a leased descriptor in the context of a schema change. One day we'll have moved all of these to the declarative schema changer and we'll stop talking about bugs like this. Release note: None
8828763
to
5f762cf
Compare
I'm not sure I love this, but I'm also not sure I hate it, and it's RFAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r2, 7 of 7 files at r3, 8 of 8 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner and @msbutler)
-- commits, line 27 at r3:
That policy is there to ensure that we don't validate using mutable cross-referenced descriptors which might be in any kind of indeterminate state. What matters is what's persisted in storage, and a descriptor which has been checked out at least once cannot be assumed to be in the same in-memory state as in storage, nor can we know for sure if it's in its final state such as it will be written to storage when the transaction commits.
I agree that we can relax the policy somewhat, let's just be careful.
-- commits, line 41 at r4:
I completely agree with this whole paragraph.
pkg/sql/catalog/descs/validate.go, line 127 at r3 (raw file):
ctx context.Context, id descpb.ID, ) (catalog.Descriptor, error) { if uc, _ := c.tc.uncommitted.getImmutableByID(id); uc != nil {
As discussed offline, let's leave this check in, but unmark the uncommitted descriptor as "checked out" as soon as it's persisted to storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @msbutler, and @postamar)
pkg/sql/catalog/descs/validate.go, line 127 at r3 (raw file):
Previously, postamar (Marius Posta) wrote…
As discussed offline, let's leave this check in, but unmark the uncommitted descriptor as "checked out" as soon as it's persisted to storage.
The problem runs deeper! If you look at the cases I had to change, i had to change them to defer adding the tables to the collection too early. In those cases, we were adding the descriptors to a batch, and then validation was reading the descriptors from disk. When validation read those descriptors from disk, it was the old version (because the batch had not been sent). In some ways, if you want to change a graph of descriptors, it means you have to read them all and then write them all. If we marked them here, as we discussed, we'd still need the changes you suggest. Is that better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner and @msbutler)
pkg/sql/catalog/descs/validate.go, line 127 at r3 (raw file):
Previously, ajwerner wrote…
The problem runs deeper! If you look at the cases I had to change, i had to change them to defer adding the tables to the collection too early. In those cases, we were adding the descriptors to a batch, and then validation was reading the descriptors from disk. When validation read those descriptors from disk, it was the old version (because the batch had not been sent). In some ways, if you want to change a graph of descriptors, it means you have to read them all and then write them all. If we marked them here, as we discussed, we'd still need the changes you suggest. Is that better?
Ok, so removing this check is the least worst alternative for the time being, in a sense. Let's go with that.
I'm leaving this be. bors r+ |
Build succeeded: |
The first commit adds back the change from #79580.
The second commit fixes fallout. As soon as #79580 merged, it tickled flakes. These flakes were caused by
operations to alter the database which would build new mutable tables from
immutable tables and thus overwrite existing entries. The fix here is to
retrieve the proper descriptors using the collection, and to make sure
that those descriptors are properly hydrated. This, in turn, revealed that
our policy checking in validation to avoid using descriptors which had been
already checked out was too severe and would cause excessive numbers of extra
round-trip.
I'll be honest, I haven't fully internalized why that policy was there. I
supposed it was there to ensure that we didn't have a case where we check
out a descriptor and then fail to write it to the store. I don't exactly
know what to do to re-establish the desired behavior of the code. At the
very least, it feels like if we did re-read the descriptor once, then we
should do something about resetting the status. I guess I could try to
unravel the mystery leading to the checkout in the first place.
The test is very flakey without this patch.
The third commit sets
AvoidLeased
flags which ought to be set. It's sadthat they ought to be set and that it's possible that there were bugs due to
the flag not being set.
Backport will address #79704.
Release note: None