Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

schemachanger,backupccl: support for backup and restore mid-change #76715

Merged
merged 5 commits into from
Feb 18, 2022

Conversation

ajwerner
Copy link
Contributor

@ajwerner ajwerner commented Feb 16, 2022

schemachanger,backupccl: support for backup and restore mid-change

This commit touches the schema changer to encode some more metadata about the
state of the schema change into the descriptors in order to better facilitate
synthesizing a job at restore time.

The major contribution, then, is to support the declarative schema changer in
restore. Along the way, it picks up some catalog niceties like batches
descriptor retreival and the nstree.Catalog.

Finally, the code is tested with a suite of datadriven tests which run schema
changes and then take the state and BACKUP/RESTORE it at each of the possible
stages (including all of the rollback stages) and makes sure that the right
outcome ultimately happens.

Note that this is currently just tested for database-level BACKUP/RESTORE
and that more testing is planned as follow-up.

sql/schemachanger: move where we remove namespace entries to be earlier

This should really use the namespace element, but I needed this fix
to not break some other tests I'm writing.

backupccl: don't back up dropped databases or schemas

Hack to deal with bugs that Aditya is fixing concurrently in #76635.

sql/schemachanger: add more end-to-end testing

This reworking is a precursor to adding backup/restore testing. Pause testing
is also extremely valuable. It caught a bug when we resume after having
finished a backfill but before we marked it as complete. This was bad
for two reasons:

  1. We weren't catching the panic on those goroutines (lesson learned, again)
  2. We weren't detecting the case where we were done.

Fixes #73071

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ajwerner ajwerner force-pushed the ajwerner/more-end-to-end-tests branch from 4137868 to 5380342 Compare February 17, 2022 14:01
@ajwerner ajwerner changed the title Ajwerner/more end to end tests schemachanger,backupccl: support for backup and restore mid-change Feb 17, 2022
@ajwerner ajwerner marked this pull request as ready for review February 17, 2022 14:04
@ajwerner ajwerner requested review from a team and dt and removed request for a team February 17, 2022 14:04
@ajwerner
Copy link
Contributor Author

@dt would you accept a method on the ctxgroup which takes an op string and does a defer ... recover to catch panics?

@ajwerner ajwerner force-pushed the ajwerner/more-end-to-end-tests branch from 5380342 to 612d9f6 Compare February 17, 2022 14:33
@dt
Copy link
Member

dt commented Feb 17, 2022

@dt would you accept a method on the ctxgroup which takes an op string and does a defer ... recover to catch panics?

Can you expand on motivation? Don't we generally eschew recover?

@ajwerner
Copy link
Contributor Author

@dt would you accept a method on the ctxgroup which takes an op string and does a defer ... recover to catch panics?

Can you expand on motivation? Don't we generally eschew recover?

Do we? #58164 / #62243 / it seems better to catch them than not and crash the server. These days that's quite the taboo.

In the first commit, it turned out that if you pause and resume a backfill in the new schema changer after you've checkpointed the progress of completing everything but before marking the state change, then we'd resume and construct an empty set of spans to do, which hits the below panic (should that really be a panic?). Anyway, it'd be fine if we caught that but is rather bad that it crashes the server. I added the below snippet to the code to prevent such things in the future, but it's a pretty heavyweight pattern to expect folks to remember every time:

			defer func() {
				switch r := recover().(type) {
				case nil:
					return
				case error:
					err = errors.Wrapf(r, "failed to %s", op)
				default:
					err = errors.AssertionFailedf("failed to %s: %v", op, r)
				}
			}()

@dt
Copy link
Member

dt commented Feb 17, 2022

catch them than not and crash the server

This doesn't feel like it belongs in this change. If this change was hitting panics, it should just check len(spans) != 0 before calling partitionSpans and avoid the panic, rather than hit and recover it, right?

If you want to make a general argument that we're using ctxgroup all over bulk ops and schema changes and that across all these we don't want operations in the workers that panic to kill the whole proc, we should make that argument in a standalone issue or change, right?

@ajwerner
Copy link
Contributor Author

This doesn't feel like it belongs in this change. If this change was hitting panics, it should just check len(spans) != 0 before calling partitionSpans and avoid the panic, rather than hit and recover it, right?

I also did that. Yes, that's definitely a right thing to do, but it seems stupid to not defensively translate the panics, as we do during all query execution and job execution.

@ajwerner
Copy link
Contributor Author

If you want to make a general argument that we're using ctxgroup all over bulk ops and schema changes and that across all these we don't want operations in the workers that panic to kill the whole proc, we should make that argument in a standalone issue or change, right?

Yes, sorry for using this PR as a conduit for a conversation about a separate change, it was just top-of-mind.

@dt
Copy link
Member

dt commented Feb 17, 2022

I'm all for cleaning/improving along the way / when a change highlights something that could be better, but in this case spawning/running separate goroutines that don't have recovers is pervasive throughout bulk and schema change code and I'd rather approach it all at once if we think it should change, not just one-off for this one case.

@ajwerner
Copy link
Contributor Author

#76734 is the issue.

@ajwerner ajwerner force-pushed the ajwerner/more-end-to-end-tests branch 2 times, most recently from cea9584 to bec814d Compare February 17, 2022 18:14
Copy link
Collaborator

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good, just very minor things that can be dealt with later

Reviewed 7 of 7 files at r1, 5 of 5 files at r2, 1 of 1 files at r3, 18 of 22 files at r4, 1 of 2 files at r5, 2 of 2 files at r6, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @dt)


pkg/ccl/backupccl/restore_planning.go, line 1231 at r6 (raw file):

		}
		var ids []descpb.ID
		switch e := t.Element().(type) {

This stuff does worry me, currently, I think it's okay. But longer-term it feels like something will be missed and cause breakages. Maybe some how have attributes for these in the future


pkg/sql/schemachanger/sctest/cumulative.go, line 454 at r6 (raw file):

             WHERE "parentID" = (SELECT id FROM system.namespace WHERE name = $1)
        )`, dbName)
			afterRestore := tdb.QueryStr(t, fetchDescriptorStateQuery)

I guess this isn't an issue right now, but any danger from fields that need to be filtered? I guess we would need to regex them out (things like timestamps)

Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @fqazi, and @postamar)


pkg/ccl/backupccl/restore_planning.go, line 1231 at r6 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

This stuff does worry me, currently, I think it's okay. But longer-term it feels like something will be missed and cause breakages. Maybe some how have attributes for these in the future

Me too, I'll put a TODO. @postamar reports that these slices are going away.


pkg/sql/schemachanger/sctest/cumulative.go, line 454 at r6 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

I guess this isn't an issue right now, but any danger from fields that need to be filtered? I guess we would need to regex them out (things like timestamps)

this is using the SHOW CREATE strings to compare the tables as opposed to the descriptor values for exactly that reason. It's not perfect, but it's a pretty good proxy for the actual state of the descriptor.

I'm going to add a call to also validate all of the descriptors in the cluster here and call it a day.

Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @fqazi, and @postamar)


pkg/ccl/backupccl/restore_planning.go, line 1231 at r6 (raw file):

Previously, ajwerner wrote…

Me too, I'll put a TODO. @postamar reports that these slices are going away.

I'm realizing that this needs to find all of the expressions in the structs too and it needs to rewrite those. Maybe reflection is in order.

Copy link
Member

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only looked at 4th commit and I only looked closely-ish at the backupccl bits but overall I like it, and really like the move towards calling specific helpers over just tons and tons of inline descriptor work in big restore functions.

pkg/sql/schemachanger/sctest/cumulative.go Show resolved Hide resolved
"github.com/stretchr/testify/require"
"golang.org/x/sync/errgroup"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2¢ I just use ctxgroup even if I don't need the ctx yet.

return nil
}

func rewriteSchemaChangerState(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tempted to say rewriteSchemaChangerState should be in an sc package.

I mean, I'm even tempted to say all these existing rewrite methods that dig into catalog elements and replace one ID with another should move to an sc package that restore just calls into, but I understand that might be out of scope here, so I guess this adding one here along side keeps it consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I thought about it and a lot of symbols would have to move. I directionally agree, but not for this PR.

return all.Catalog, nil
}

// createDeclarativeSchemaChangeJobs is called during the last phase of a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a good candidate to be in a sc package. could genericize the comment slightly if it was moved to such, to say "when ingesting external descriptors such as in the last phase of restore"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this out to a new package scbackup.

@ajwerner ajwerner force-pushed the ajwerner/more-end-to-end-tests branch from bec814d to 7c78223 Compare February 18, 2022 03:55
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, @fqazi, and @postamar)


pkg/ccl/backupccl/restore_planning.go, line 1231 at r6 (raw file):

Previously, ajwerner wrote…

I'm realizing that this needs to find all of the expressions in the structs too and it needs to rewrite those. Maybe reflection is in order.

Okay, the reflection dust has been sprinkled onto this.

Copy link
Collaborator

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a quick look at the latest set of commits. Happy with the reflection-based approach! :lgtm:

Reviewed 26 of 26 files at r7, 5 of 5 files at r8, 13 of 13 files at r9, 3 of 3 files at r10, 24 of 24 files at r11, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @dt, @fqazi, and @postamar)


pkg/sql/schemachanger/screl/walk.go, line 24 at r10 (raw file):

// WalkDescIDs calls f for every catid.DescID field in e.
func WalkDescIDs(e scpb.Element, f func(id *catid.DescID) error) error {

Love this approach much cleaner!


pkg/sql/schemachanger/screl/walk_test.go, line 31 at r10 (raw file):

// as advertised.
func TestWalk(t *testing.T) {

Nit: extra new line

@ajwerner ajwerner force-pushed the ajwerner/more-end-to-end-tests branch from 7c78223 to b634cde Compare February 18, 2022 04:50
ajwerner and others added 2 commits February 18, 2022 00:14
This reworking is a precursor to adding backup/restore testing. Pause testing
is also extremely valuable. It caught a bug when we resume after having
finished a backfill but before we marked it as complete. This was bad
for two reasons:

1) We weren't catching the panic on those goroutines (lesson learned, again)
2) We weren't detecting the case where we were done.

Release note: None
This should really use the namespace element, but I needed this fix
to not break some other tests I'm writing, so it's just a hack.

Release note: None
This is handy to know what in a proto message is intended to be a sql
expression.

Release note: None
We'll need this to rewrite these parts in restore.

Release note: None
This commit touches the schema changer to encode some more metadata about the
state of the schema change into the descriptors in order to better facilitate
synthesizing a job at restore time.

The major contribution, then, is to support the declarative schema changer in
restore. Along the way, it picks up some catalog niceties like batches
descriptor retreival and the nstree.Catalog.

Finally, the code is tested with a suite of datadriven tests which run schema
changes and then take the state and BACKUP/RESTORE it at each of the possible
stages (including all of the rollback stages) and makes sure that the right
outcome ultimately happens.

Note that this is currently just tested for database-level BACKUP/RESTORE
and that more testing is planned as follow-up.

Release note: None
@ajwerner ajwerner force-pushed the ajwerner/more-end-to-end-tests branch from b634cde to c10b5ef Compare February 18, 2022 05:15
@ajwerner
Copy link
Contributor Author

TFTR!

bors r+

@craig
Copy link
Contributor

craig bot commented Feb 18, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Feb 18, 2022

Build succeeded:

@craig craig bot merged commit fff55f7 into cockroachdb:master Feb 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql/schemachanger,backupccl: plot a path for adoption of the declarative schema changer
4 participants