upgrades: assert descriptor repair has correct set of targets #112700

annrpom · 2023-10-19T16:59:37Z

This patch adds a test to assert that during automated repair of
corrupt descriptors, we do not try to repair a descriptor that
is not corrupted.

This includes an addition to the previous check for post-deserialization changes that we had for the upgrade logic. We found that during the automatic repair, the descriptor already gets updated in place during the unsafe_upsert_descriptor (in WriteDescToBatch) process and that the previous check was a bit redundant. Thus, the check benefits from being a bit stricter - short-circuiting when the only post-deserialization change is SetModTimeToMVCCTimestamp (not having to update descriptors if they have already been updated).

While testing, it was discovered that function descriptors (regardless of corruption/repair) had their descriptor version increased due to grantExecuteToPublicOnAllFunctions being called on each function during a cluster upgrade; however, we noticed that the functions we were testing already had execute privileges for public. Thus, a check was added in said logic that ensures functions in this situation (where public already has execute priv. for the func) do not try to grant execute again.

Epic: none
Fixes: #110906

Release note: None

cockroach-teamcity · 2023-10-19T16:59:48Z

This change is

annrpom · 2023-10-19T17:08:33Z

OK I have pushed this to illustrate a confusion: in my TestUpgradeNoCorruption, I have noticed that while debugging, the table descriptor for foo does not mean this condition if !desc.GetPostDeserializationChanges().HasChanges().
However, the test also illustrates that there are no objects in the invalid_objects table.
This leads me to believe we should replace this check with the invalid objects check as initially proposed since said descriptor isn't corrupted.

I wanted to leave this comment just in case my test is just somehow incorrect.

rafiss · 2023-10-19T21:11:07Z

pkg/upgrade/upgrades/first_upgrade.go

@@ -56,6 +56,9 @@ func FirstUpgradeFromRelease(
 	var batch catalog.DescriptorIDSet
 	const batchSize = 1000
 	if err := all.ForEachDescriptor(func(desc catalog.Descriptor) error {
+		if desc.GetName() == "foo" {


so it does seem correct that this line would be reached, but for any descriptor that is not corrupt, the check right after this one should cause the repair to short-circuit. i think that should be good enough -- even the invalid_objects and repairable_catalog_corruptions tables are populated by first scanning all descriptors, then filtering only for the ones that have issues.

Ah - ok so I left this in here (probably should've commented about this earlier) because I only wanted to debug the not corrupt table "foo" & this was the way I saw to do this with skipping all of the system(?) descriptors. However, the check right after this one does not cause the repair to short-circuit :(

So my confusion is: is it my testing (in TestUpgradeNoCorruption) that is causing this issue? I shall make a loom tomorrow and send it to you to be more illustrative :^)

pairing up would be great!

annrpom · 2023-10-25T21:14:52Z

OK! updated tests reflect new discoveries. wish there was a facepalm emote on github.

the precondition check (FirstUpgradeFromReleasePrecondition) is actually what will decide whether or not a descriptor can be repaired (for now, we only repair kv_repairable_catalog_corruptions). tests now reflect that not-corrupt descriptors do not show up in kv_repairable_catalog_corruptions and will thus not get automatically repaired.

here's something i found while writing said test: function descriptors get their version incremented regardless of 'repaired corruption". thus, if we are looking at a repairable function descriptor, we actually end up increasing its version twice. once during the repair in UnsafeUpsertDescriptor and another during upgrades.grantExecuteToPublicOnAllFunctions -> WriteDescToBatch. ~~is this expected? why does this happen for function descriptors only??~~ (yes, and it is due to the upgrade path that grants execute to public for all functions)

rafiss

did we also need that change in first_upgrade.go, where we'd skip descriptors where the only change is SetModTimeToMVCCTimestamp?

annrpom · 2023-10-30T15:41:35Z

Couple of notes:
-> It looks like that check in upgradeDescriptors does make sense. In FirstUpgradeFromRelease, we iterate through all descriptors and put only the ones that have PostDeserializationChanges in a "batch" to upgrade if the batch exceeds a certain amount; otherwise, we have a small enough amount of changed descriptors that we can just upgrade all at once.

@rafiss

did we also need that change in first_upgrade.go, where we'd skip descriptors where the only change is SetModTimeToMVCCTimestamp?

This is also something that I looked at - it seems as if this check was no different than the one we had before (I think the fact that I was incorrectly testing in the same manner as FirstaUpgradeFromRelease lead us to the belief that it would be different). In the new tests, regardless of a corrupt or not-corrupt descriptor, we see that that SetMVCCTimestamp is the only post deserialization change to exist; so, if we still had this check, corrupt descriptors that were repaired will not get upgraded

However, this still brings us back to the first problem we looked at - which is: this upgrade process seems to upgrade descriptors that haven't changed :( Is there any other way we can check other than PostDeserializationChanges? Maybe we can make note of a descriptor popping up in kv_repairable_catalog_preconditions (since it will probably be gone from this table once the precondition repairs it) and then have a membership check as the guard to upgrading?

rafiss · 2023-10-30T15:45:53Z

This is also something that I looked at - it seems as if this check was no different than the one we had before

Which check from before are you referring to?

In the new tests, regardless of a corrupt or not-corrupt descriptor, we see that that SetMVCCTimestamp is the only post deserialization change to exist; so, if we still had this check, corrupt descriptors that were repaired will not get upgraded

That seems off to me. The check for SetModTimeToMVCCTimestamp should be here. But even if we add that, the FirstUpgradeFromReleasePrecondition function should still repair corruptions.

However, this still brings us back to the first problem we looked at - which is: this upgrade process seems to upgrade descriptors that haven't changed

Making the change to check SetModTimeToMVCCTimestamp in FirstUpgradeFromRelease should avoid this.

annrpom · 2023-10-30T16:02:22Z

Which check from before are you referring to?

"No different from the one we had before" is a bad way to put it - I meant "it is still not looking at the right group of descriptors"
Just checking hasChanges was the previous check I was referring to - just checking hasChanges short-circuits for "normal" descriptors, but checking hasChanges or if the only post deserialization change is MVCCTimestamp short-circuits for both "normal" descriptors and repaired/updated descriptors

That seems off to me. The check for SetModTimeToMVCCTimestamp should be here. But even if we add that, the FirstUpgradeFromReleasePrecondition function should still repair corruptions.

Sorry, I think I am using my words too loosely - yes, they would still repair corruptions. I meant updated as in upgradeDescriptors - assuming that the same check that exists in FirstUpgradeFromRelease will be the same check that exists in upgradeDescriptors (I believe that these two checks should remain consistent, but I could be wrong)

I believe that these two checks should remain consistent, but I could be wrong

I think that this belief is what is causing confusion? If we make the change to check SetModTimeToMVCCTimestamp in FirstUpgradeFromRelease, I feel like there is not a point in having it if we do not change this same check in upgradeDescriptors - since upgradeDescriptors will still get called regardless and still do the actual CPut on descriptors that haven't changed if the two checks do not align

+++

UPDATE: for anyone that is reading this/future me - this was an oversight in my reading of the code! I did not see that batch was the only set of descriptors that would have been called for upgradeDescriptors

This patch adds a test to assert that during automated repair of corrupt descriptors, we do not try to repair a descriptor that is not corrupted. Epic: none Fixes: cockroachdb#110906 Release note: None

rafiss

lgtm! nice work

Reviewed 2 of 4 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @annrpom)

annrpom · 2023-11-02T14:35:20Z

TFTR! ('-')7

bors r=rafiss

craig · 2023-11-02T15:36:08Z

Build succeeded:

Bazel Essential CI (Cockroach)

rafiss · 2023-11-02T17:45:43Z

blathers backport 23.2

annrpom requested a review from a team October 19, 2023 17:26

annrpom force-pushed the test-descriptor-repair branch from 05b8453 to 1873b7c Compare October 19, 2023 17:51

rafiss reviewed Oct 19, 2023

View reviewed changes

annrpom force-pushed the test-descriptor-repair branch 2 times, most recently from 2a64760 to 38d4419 Compare October 25, 2023 21:03

annrpom force-pushed the test-descriptor-repair branch 3 times, most recently from db8002d to 62827f3 Compare October 30, 2023 15:22

annrpom marked this pull request as ready for review October 30, 2023 15:22

annrpom requested a review from a team October 30, 2023 15:22

rafiss reviewed Oct 30, 2023

View reviewed changes

annrpom force-pushed the test-descriptor-repair branch from 62827f3 to 8ef1930 Compare October 30, 2023 16:41

rafiss mentioned this pull request Oct 30, 2023

upgrades: auto-repair of catalog corruptions during upgrade may extend cluster finalization indefinitely #110321

Closed

upgrades: assert descriptor repair has correct set of targets

64c1f6c

This patch adds a test to assert that during automated repair of corrupt descriptors, we do not try to repair a descriptor that is not corrupted. Epic: none Fixes: cockroachdb#110906 Release note: None

annrpom force-pushed the test-descriptor-repair branch from 8ef1930 to 64c1f6c Compare November 1, 2023 18:20

annrpom requested a review from rafiss November 1, 2023 18:21

rafiss approved these changes Nov 1, 2023

View reviewed changes

craig bot merged commit 8e059c2 into cockroachdb:master Nov 2, 2023
3 checks passed

annrpom deleted the test-descriptor-repair branch November 2, 2023 15:36

blathers-crl bot mentioned this pull request Nov 2, 2023

release-23.2: upgrades: assert descriptor repair has correct set of targets #113692

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upgrades: assert descriptor repair has correct set of targets #112700

upgrades: assert descriptor repair has correct set of targets #112700

annrpom commented Oct 19, 2023 •

edited

Loading

cockroach-teamcity commented Oct 19, 2023

annrpom commented Oct 19, 2023 •

edited

Loading

rafiss Oct 19, 2023

annrpom Oct 19, 2023 •

edited

Loading

rafiss Oct 19, 2023

annrpom commented Oct 25, 2023 •

edited

Loading

rafiss left a comment

annrpom commented Oct 30, 2023 •

edited

Loading

rafiss commented Oct 30, 2023

annrpom commented Oct 30, 2023 •

edited

Loading

rafiss left a comment

annrpom commented Nov 2, 2023

craig bot commented Nov 2, 2023

rafiss commented Nov 2, 2023

upgrades: assert descriptor repair has correct set of targets #112700

upgrades: assert descriptor repair has correct set of targets #112700

Conversation

annrpom commented Oct 19, 2023 • edited Loading

cockroach-teamcity commented Oct 19, 2023

annrpom commented Oct 19, 2023 • edited Loading

rafiss Oct 19, 2023

Choose a reason for hiding this comment

annrpom Oct 19, 2023 • edited Loading

Choose a reason for hiding this comment

rafiss Oct 19, 2023

Choose a reason for hiding this comment

annrpom commented Oct 25, 2023 • edited Loading

rafiss left a comment

Choose a reason for hiding this comment

annrpom commented Oct 30, 2023 • edited Loading

rafiss commented Oct 30, 2023

annrpom commented Oct 30, 2023 • edited Loading

rafiss left a comment

Choose a reason for hiding this comment

annrpom commented Nov 2, 2023

craig bot commented Nov 2, 2023

rafiss commented Nov 2, 2023

annrpom commented Oct 19, 2023 •

edited

Loading

annrpom commented Oct 19, 2023 •

edited

Loading

annrpom Oct 19, 2023 •

edited

Loading

annrpom commented Oct 25, 2023 •

edited

Loading

annrpom commented Oct 30, 2023 •

edited

Loading

annrpom commented Oct 30, 2023 •

edited

Loading