Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrades: assert descriptor repair has correct set of targets #112700

Merged
merged 1 commit into from
Nov 2, 2023

Conversation

annrpom
Copy link
Contributor

@annrpom annrpom commented Oct 19, 2023

This patch adds a test to assert that during automated repair of
corrupt descriptors, we do not try to repair a descriptor that
is not corrupted.

This includes an addition to the previous check for post-deserialization changes that we had for the upgrade logic. We found that during the automatic repair, the descriptor already gets updated in place during the unsafe_upsert_descriptor (in WriteDescToBatch) process and that the previous check was a bit redundant. Thus, the check benefits from being a bit stricter - short-circuiting when the only post-deserialization change is SetModTimeToMVCCTimestamp (not having to update descriptors if they have already been updated).

While testing, it was discovered that function descriptors (regardless of corruption/repair) had their descriptor version increased due to grantExecuteToPublicOnAllFunctions being called on each function during a cluster upgrade; however, we noticed that the functions we were testing already had execute privileges for public. Thus, a check was added in said logic that ensures functions in this situation (where public already has execute priv. for the func) do not try to grant execute again.

Epic: none
Fixes: #110906

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@annrpom
Copy link
Contributor Author

annrpom commented Oct 19, 2023

OK I have pushed this to illustrate a confusion: in my TestUpgradeNoCorruption, I have noticed that while debugging, the table descriptor for foo does not mean this condition if !desc.GetPostDeserializationChanges().HasChanges().
However, the test also illustrates that there are no objects in the invalid_objects table.
This leads me to believe we should replace this check with the invalid objects check as initially proposed since said descriptor isn't corrupted.

I wanted to leave this comment just in case my test is just somehow incorrect.

@annrpom annrpom requested a review from a team October 19, 2023 17:26
@annrpom annrpom force-pushed the test-descriptor-repair branch from 05b8453 to 1873b7c Compare October 19, 2023 17:51
@@ -56,6 +56,9 @@ func FirstUpgradeFromRelease(
var batch catalog.DescriptorIDSet
const batchSize = 1000
if err := all.ForEachDescriptor(func(desc catalog.Descriptor) error {
if desc.GetName() == "foo" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it does seem correct that this line would be reached, but for any descriptor that is not corrupt, the check right after this one should cause the repair to short-circuit. i think that should be good enough -- even the invalid_objects and repairable_catalog_corruptions tables are populated by first scanning all descriptors, then filtering only for the ones that have issues.

Copy link
Contributor Author

@annrpom annrpom Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - ok so I left this in here (probably should've commented about this earlier) because I only wanted to debug the not corrupt table "foo" & this was the way I saw to do this with skipping all of the system(?) descriptors. However, the check right after this one does not cause the repair to short-circuit :(

So my confusion is: is it my testing (in TestUpgradeNoCorruption) that is causing this issue? I shall make a loom tomorrow and send it to you to be more illustrative :^)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pairing up would be great!

@annrpom annrpom force-pushed the test-descriptor-repair branch 2 times, most recently from 2a64760 to 38d4419 Compare October 25, 2023 21:03
@annrpom
Copy link
Contributor Author

annrpom commented Oct 25, 2023

OK! updated tests reflect new discoveries. wish there was a facepalm emote on github.

the precondition check (FirstUpgradeFromReleasePrecondition) is actually what will decide whether or not a descriptor can be repaired (for now, we only repair kv_repairable_catalog_corruptions). tests now reflect that not-corrupt descriptors do not show up in kv_repairable_catalog_corruptions and will thus not get automatically repaired.

here's something i found while writing said test: function descriptors get their version incremented regardless of 'repaired corruption". thus, if we are looking at a repairable function descriptor, we actually end up increasing its version twice. once during the repair in UnsafeUpsertDescriptor and another during upgrades.grantExecuteToPublicOnAllFunctions -> WriteDescToBatch. is this expected? why does this happen for function descriptors only?? (yes, and it is due to the upgrade path that grants execute to public for all functions)

@annrpom annrpom force-pushed the test-descriptor-repair branch 3 times, most recently from db8002d to 62827f3 Compare October 30, 2023 15:22
@annrpom annrpom marked this pull request as ready for review October 30, 2023 15:22
@annrpom annrpom requested a review from a team October 30, 2023 15:22
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we also need that change in first_upgrade.go, where we'd skip descriptors where the only change is SetModTimeToMVCCTimestamp?

@annrpom
Copy link
Contributor Author

annrpom commented Oct 30, 2023

Couple of notes:
-> It looks like that check in upgradeDescriptors does make sense. In FirstUpgradeFromRelease, we iterate through all descriptors and put only the ones that have PostDeserializationChanges in a "batch" to upgrade if the batch exceeds a certain amount; otherwise, we have a small enough amount of changed descriptors that we can just upgrade all at once.

@rafiss

did we also need that change in first_upgrade.go, where we'd skip descriptors where the only change is SetModTimeToMVCCTimestamp?

This is also something that I looked at - it seems as if this check was no different than the one we had before (I think the fact that I was incorrectly testing in the same manner as FirstaUpgradeFromRelease lead us to the belief that it would be different). In the new tests, regardless of a corrupt or not-corrupt descriptor, we see that that SetMVCCTimestamp is the only post deserialization change to exist; so, if we still had this check, corrupt descriptors that were repaired will not get upgraded

However, this still brings us back to the first problem we looked at - which is: this upgrade process seems to upgrade descriptors that haven't changed :( Is there any other way we can check other than PostDeserializationChanges? Maybe we can make note of a descriptor popping up in kv_repairable_catalog_preconditions (since it will probably be gone from this table once the precondition repairs it) and then have a membership check as the guard to upgrading?

@rafiss
Copy link
Collaborator

rafiss commented Oct 30, 2023

This is also something that I looked at - it seems as if this check was no different than the one we had before

Which check from before are you referring to?

In the new tests, regardless of a corrupt or not-corrupt descriptor, we see that that SetMVCCTimestamp is the only post deserialization change to exist; so, if we still had this check, corrupt descriptors that were repaired will not get upgraded

That seems off to me. The check for SetModTimeToMVCCTimestamp should be here. But even if we add that, the FirstUpgradeFromReleasePrecondition function should still repair corruptions.

However, this still brings us back to the first problem we looked at - which is: this upgrade process seems to upgrade descriptors that haven't changed

Making the change to check SetModTimeToMVCCTimestamp in FirstUpgradeFromRelease should avoid this.

@annrpom
Copy link
Contributor Author

annrpom commented Oct 30, 2023

Which check from before are you referring to?

"No different from the one we had before" is a bad way to put it - I meant "it is still not looking at the right group of descriptors"
Just checking hasChanges was the previous check I was referring to - just checking hasChanges short-circuits for "normal" descriptors, but checking hasChanges or if the only post deserialization change is MVCCTimestamp short-circuits for both "normal" descriptors and repaired/updated descriptors

That seems off to me. The check for SetModTimeToMVCCTimestamp should be here. But even if we add that, the FirstUpgradeFromReleasePrecondition function should still repair corruptions.

Sorry, I think I am using my words too loosely - yes, they would still repair corruptions. I meant updated as in upgradeDescriptors - assuming that the same check that exists in FirstUpgradeFromRelease will be the same check that exists in upgradeDescriptors (I believe that these two checks should remain consistent, but I could be wrong)

I believe that these two checks should remain consistent, but I could be wrong

I think that this belief is what is causing confusion? If we make the change to check SetModTimeToMVCCTimestamp in FirstUpgradeFromRelease, I feel like there is not a point in having it if we do not change this same check in upgradeDescriptors - since upgradeDescriptors will still get called regardless and still do the actual CPut on descriptors that haven't changed if the two checks do not align

+++

UPDATE: for anyone that is reading this/future me - this was an oversight in my reading of the code! I did not see that batch was the only set of descriptors that would have been called for upgradeDescriptors

This patch adds a test to assert that during automated repair of
corrupt descriptors, we do not try to repair a descriptor that
is not corrupted.

Epic: none
Fixes: cockroachdb#110906

Release note: None
@annrpom annrpom force-pushed the test-descriptor-repair branch from 8ef1930 to 64c1f6c Compare November 1, 2023 18:20
@annrpom annrpom requested a review from rafiss November 1, 2023 18:21
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! nice work

Reviewed 2 of 4 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @annrpom)

@annrpom
Copy link
Contributor Author

annrpom commented Nov 2, 2023

TFTR! ('-')7

bors r=rafiss

@craig
Copy link
Contributor

craig bot commented Nov 2, 2023

Build succeeded:

@craig craig bot merged commit 8e059c2 into cockroachdb:master Nov 2, 2023
3 checks passed
@annrpom annrpom deleted the test-descriptor-repair branch November 2, 2023 15:36
@rafiss
Copy link
Collaborator

rafiss commented Nov 2, 2023

blathers backport 23.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

upgrades: Add sanity check that we only repair descriptors that show up in invalid_objects table.
3 participants