Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: ensure descriptors versions change only once, fix related bugs #79697

Merged
merged 3 commits into from
Apr 13, 2022

Conversation

ajwerner
Copy link
Contributor

@ajwerner ajwerner commented Apr 8, 2022

The first commit adds back the change from #79580.

The second commit fixes fallout. As soon as #79580 merged, it tickled flakes. These flakes were caused by
operations to alter the database which would build new mutable tables from
immutable tables and thus overwrite existing entries. The fix here is to
retrieve the proper descriptors using the collection, and to make sure
that those descriptors are properly hydrated. This, in turn, revealed that
our policy checking in validation to avoid using descriptors which had been
already checked out was too severe and would cause excessive numbers of extra
round-trip.

I'll be honest, I haven't fully internalized why that policy was there. I
supposed it was there to ensure that we didn't have a case where we check
out a descriptor and then fail to write it to the store. I don't exactly
know what to do to re-establish the desired behavior of the code. At the
very least, it feels like if we did re-read the descriptor once, then we
should do something about resetting the status. I guess I could try to
unravel the mystery leading to the checkout in the first place.

The test is very flakey without this patch.

The third commit sets AvoidLeased flags which ought to be set. It's sad
that they ought to be set and that it's possible that there were bugs due to
the flag not being set.

Backport will address #79704.

Release note: None

@ajwerner ajwerner requested a review from a team April 8, 2022 21:07
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@postamar postamar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

This change is good. Seems like some tests' expectations need adjusting.

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)

@ajwerner
Copy link
Contributor Author

ajwerner commented Apr 8, 2022

This turns out to be too simplistic. Added #79701 to take some pressure off.

@ajwerner ajwerner marked this pull request as draft April 8, 2022 21:59
…ange

It's a bug any time we increment a version more than once in the context
of a single descriptor collection. Let's enforce it.

Release note: None
@ajwerner ajwerner force-pushed the ajwerner/fix-new-flake branch from 2faa07b to 8f5f077 Compare April 11, 2022 04:24
@ajwerner ajwerner changed the title sql: fix flake in schemachange.TestWorkload sql: ensure descriptors versions change only once, fix related bugs Apr 11, 2022
@ajwerner ajwerner force-pushed the ajwerner/fix-new-flake branch 2 times, most recently from 40a9157 to 8828763 Compare April 11, 2022 06:36
ajwerner and others added 2 commits April 11, 2022 17:44
…database

As soon as cockroachdb#79580 merged, it tickled flakes. These flakes were caused by
operations to alter the database which would build new mutable tables from
immutable tables and thus overwrite existing entries. The fix here is to
retrieve the proper descriptors using the collection, and to make sure
that those descriptors are properly hydrated. This, in turn, revealed that
our policy checking in validation to avoid using descriptors which had been
already checked out was too severe and would cause excessive numbers of extra
round-trip.

I'll be honest, I haven't fully internalized why that policy was there. I
supposed it was there to ensure that we didn't have a case where we check
out a descriptor and then fail to write it to the store. I don't exactly
know what to do to re-establish the desired behavior of the code. At the
very least, it feels like if we did re-read the descriptor once, then we
should do something about resetting the status. I guess I could try to
unravel the mystery leading to the checkout in the first place.

The test is very flakey without this patch.

Release note: None
I don't have evidence that these were causing bugs, but they might
have been. Whenever we're doing a schema change, we need to avoid leasing.
All of this hints at a severe abstraction problem. It shouldn't be possible to
get a leased descriptor in the context of a schema change. One day we'll have
moved all of these to the declarative schema changer and we'll stop talking
about bugs like this.

Release note: None
@ajwerner ajwerner force-pushed the ajwerner/fix-new-flake branch from 8828763 to 5f762cf Compare April 11, 2022 21:45
@ajwerner ajwerner marked this pull request as ready for review April 12, 2022 13:43
@ajwerner ajwerner requested review from a team and shermanCRL and removed request for a team April 12, 2022 13:43
@ajwerner
Copy link
Contributor Author

I'm not sure I love this, but I'm also not sure I hate it, and it's RFAL.

@shermanCRL shermanCRL requested review from msbutler and removed request for shermanCRL April 12, 2022 13:49
Copy link
Contributor

@postamar postamar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r2, 7 of 7 files at r3, 8 of 8 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner and @msbutler)


-- commits, line 27 at r3:
That policy is there to ensure that we don't validate using mutable cross-referenced descriptors which might be in any kind of indeterminate state. What matters is what's persisted in storage, and a descriptor which has been checked out at least once cannot be assumed to be in the same in-memory state as in storage, nor can we know for sure if it's in its final state such as it will be written to storage when the transaction commits.

I agree that we can relax the policy somewhat, let's just be careful.


-- commits, line 41 at r4:
I completely agree with this whole paragraph.


pkg/sql/catalog/descs/validate.go, line 127 at r3 (raw file):

	ctx context.Context, id descpb.ID,
) (catalog.Descriptor, error) {
	if uc, _ := c.tc.uncommitted.getImmutableByID(id); uc != nil {

As discussed offline, let's leave this check in, but unmark the uncommitted descriptor as "checked out" as soon as it's persisted to storage.

Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @msbutler, and @postamar)


pkg/sql/catalog/descs/validate.go, line 127 at r3 (raw file):

Previously, postamar (Marius Posta) wrote…

As discussed offline, let's leave this check in, but unmark the uncommitted descriptor as "checked out" as soon as it's persisted to storage.

The problem runs deeper! If you look at the cases I had to change, i had to change them to defer adding the tables to the collection too early. In those cases, we were adding the descriptors to a batch, and then validation was reading the descriptors from disk. When validation read those descriptors from disk, it was the old version (because the batch had not been sent). In some ways, if you want to change a graph of descriptors, it means you have to read them all and then write them all. If we marked them here, as we discussed, we'd still need the changes you suggest. Is that better?

Copy link
Contributor

@postamar postamar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner and @msbutler)


pkg/sql/catalog/descs/validate.go, line 127 at r3 (raw file):

Previously, ajwerner wrote…

The problem runs deeper! If you look at the cases I had to change, i had to change them to defer adding the tables to the collection too early. In those cases, we were adding the descriptors to a batch, and then validation was reading the descriptors from disk. When validation read those descriptors from disk, it was the old version (because the batch had not been sent). In some ways, if you want to change a graph of descriptors, it means you have to read them all and then write them all. If we marked them here, as we discussed, we'd still need the changes you suggest. Is that better?

Ok, so removing this check is the least worst alternative for the time being, in a sense. Let's go with that.

@ajwerner
Copy link
Contributor Author

I'm leaving this be.

bors r+

@craig
Copy link
Contributor

craig bot commented Apr 13, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants