Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrades: try to repair catalog corruptions during upgrade #105832

Merged
merged 5 commits into from
Jul 5, 2023

Conversation

postamar
Copy link
Contributor

@postamar postamar commented Jun 29, 2023

This change enhances the recently-added upgrade precondition check which
checks for catalog corruptions by trying to repair these automatically
when possible.

Fixes: #104425
Fixes: #85265

Release note (general change): upgrading the cluster version to a new
release will not only check for descriptor- and other catalog
corruptions but will attempt to repair some of them on a best-effort
basis. This should seamlessly get rid of all longstanding descriptor
back-reference corruptions, which typically don't manifest themselves
until a schema change or an upgrade are performed.

@postamar postamar changed the title catalog: add StripDanglingBackReferences method https://github.com/postamar/cockroach/pull/new/fix-104425-repair-catalog Jun 29, 2023
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@postamar postamar changed the title https://github.com/postamar/cockroach/pull/new/fix-104425-repair-catalog sql: add crdb_internal.kv_repairable_catalog_corruptions vtable Jun 29, 2023
@postamar postamar force-pushed the fix-104425-repair-catalog branch from 307a1cd to 3731817 Compare June 29, 2023 20:46
@postamar postamar changed the title sql: add crdb_internal.kv_repairable_catalog_corruptions vtable builtins: add crdb_internal.repair_catalog_corruption builtin Jun 29, 2023
@postamar postamar force-pushed the fix-104425-repair-catalog branch from 3731817 to cc16179 Compare June 29, 2023 21:03
@postamar postamar added backport-22.2.x backport-23.1.x Flags PRs that need to be backported to 23.1 labels Jun 29, 2023
@postamar postamar force-pushed the fix-104425-repair-catalog branch 2 times, most recently from 282d0f6 to c5c4243 Compare July 4, 2023 18:29
@postamar postamar changed the title builtins: add crdb_internal.repair_catalog_corruption builtin upgrade: try to repair catalog corruptions during upgrade Jul 4, 2023
@postamar postamar force-pushed the fix-104425-repair-catalog branch from c5c4243 to 9558d44 Compare July 4, 2023 19:00
@postamar postamar changed the title upgrade: try to repair catalog corruptions during upgrade upgrades: try to repair catalog corruptions during upgrade Jul 4, 2023
@postamar postamar force-pushed the fix-104425-repair-catalog branch from 9558d44 to 3b28c87 Compare July 4, 2023 19:49
@postamar postamar marked this pull request as ready for review July 4, 2023 19:50
@postamar postamar requested a review from a team as a code owner July 4, 2023 19:50
@postamar postamar requested a review from a team July 4, 2023 19:50
@postamar postamar requested review from a team as code owners July 4, 2023 19:50
@postamar postamar requested review from nkodali and msirek and removed request for a team, nkodali and msirek July 4, 2023 19:50
@postamar postamar requested review from dt and knz July 4, 2023 19:51
@postamar postamar requested a review from rafiss July 4, 2023 19:51
@postamar
Copy link
Contributor Author

postamar commented Jul 4, 2023

This is ready for review, I'll reconcile the conflicts around hardcoded OIDs before borsing.

Copy link
Collaborator

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Great work, this really good!

Reviewed 1 of 1 files at r1, 6 of 6 files at r2, 16 of 16 files at r3, 4 of 4 files at r4, 4 of 4 files at r5, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @dt, @knz, @postamar, and @rafiss)


pkg/sql/sem/builtins/builtins.go line 5226 at r4 (raw file):

SELECT
	CASE corruption
	WHEN 'descriptor'

Maybe document the corruption types in the code, just to make things a bit cleaner.


pkg/upgrade/upgrades/first_upgrade.go line 139 at r5 (raw file):

	} else if hasRows {
		// Attempt to repair catalog corruptions.
		log.Info(ctx, "auto-repairing catalog corruptions detected during upgrade attempt")

From a diagnosability viewpoint, do you think logging which descriptors/corruption type is worth logging? It might be helpful for customer scenarios where we didn't cover something.


pkg/sql/logictest/testdata/logic_test/schema_repair line 284 at r4 (raw file):

SELECT * FROM "".crdb_internal.kv_repairable_catalog_corruptions ORDER BY 1,2,3,4
----
104  105  _corrupt_backref_typ  114    descriptor

Nice!

@postamar postamar force-pushed the fix-104425-repair-catalog branch from 3b28c87 to f32a362 Compare July 5, 2023 18:05
Copy link
Contributor Author

@postamar postamar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @dt, @fqazi, @knz, and @rafiss)


pkg/sql/sem/builtins/builtins.go line 5226 at r4 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

Maybe document the corruption types in the code, just to make things a bit cleaner.

Done, in the vtable definition mostly.


pkg/upgrade/upgrades/first_upgrade.go line 139 at r5 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

From a diagnosability viewpoint, do you think logging which descriptors/corruption type is worth logging? It might be helpful for customer scenarios where we didn't cover something.

Good question. I'm worried there's a real possibility of there being too many of them. In fact, I should perform the repair in batches!


pkg/sql/logictest/testdata/logic_test/schema_repair line 284 at r4 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

Nice!

Done.

Copy link
Collaborator

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @dt, @knz, @postamar, and @rafiss)


pkg/upgrade/upgrades/first_upgrade.go line 139 at r5 (raw file):

Previously, postamar (Marius Posta) wrote…

Good question. I'm worried there's a real possibility of there being too many of them. In fact, I should perform the repair in batches!

Batching here makes tons of sense. I guess we could limit that type of logging into a small number or the First N of these, but I'll leave that up to you.

Marius Posta added 5 commits July 5, 2023 15:38
Previously, function descriptors were not validated when querying this
virtual table. This patch fixes this.

Informs cockroachdb#104418.

Release note: None
Descriptor repairs which can be done in post-deserialization should
always be performed then and not elsewhere, otherwise we jeopardize the
effectiveness of the recently-added catalog corruption repair builtin.

This has no impact otherwise.

Release note: None
This virtual view lists system.descriptor and system.namespace entries
which can be repaired using crdb_internal.unsafe_* and
crdb_internal.repaired_descriptor builtin functions.

Conceptually, this view can be thought of as a projection from
crdb_internal.invalid_objects: anything that shows up in this view
should also show up in that table.

Informs cockroachdb#104425.

Release note: None
This new builtin is meant to be used with the recently-added
crdb_internal.kv_repairable_catalog_corruptions virtual table.

Typically,

    SELECT crdb_internal.repair_catalog_corruption(id, corruption)
    FROM "".crdb_internal.kv_repairable_catalog_corruptions

will ensure that a subsequent query of the vtable will return empty.

Informs cockroachdb#104425.

Release note: None
This change enhances the recently-added upgrade precondition check which
checks for catalog corruptions by trying to repair these automatically
when possible.

Fixes: cockroachdb#104425
Fixes: cockroachdb#85265

Release note (general change): upgrading the cluster version to a new
release will not only check for descriptor- and other catalog
corruptions but will attempt to repair some of them on a best-effort
basis. This should seamlessly get rid of all longstanding descriptor
back-reference corruptions, which typically don't manifest themselves
until a schema change or an upgrade are performed.
@postamar postamar force-pushed the fix-104425-repair-catalog branch from f32a362 to 04bb9e7 Compare July 5, 2023 19:39
@postamar
Copy link
Contributor Author

postamar commented Jul 5, 2023

I'll leave that for a later time, I think.

bors r+

@craig
Copy link
Contributor

craig bot commented Jul 5, 2023

Build succeeded:

@craig craig bot merged commit d7c0e35 into cockroachdb:master Jul 5, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jul 5, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from d8a2197 to blathers/backport-release-23.1-105832: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
3 participants