-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
schemachanger,backupccl: support for backup and restore mid-change #76715
schemachanger,backupccl: support for backup and restore mid-change #76715
Conversation
4137868
to
5380342
Compare
@dt would you accept a method on the |
5380342
to
612d9f6
Compare
Can you expand on motivation? Don't we generally eschew |
Do we? #58164 / #62243 / it seems better to catch them than not and crash the server. These days that's quite the taboo. In the first commit, it turned out that if you pause and resume a backfill in the new schema changer after you've checkpointed the progress of completing everything but before marking the state change, then we'd resume and construct an empty set of spans to do, which hits the below panic (should that really be a panic?). Anyway, it'd be fine if we caught that but is rather bad that it crashes the server. I added the below snippet to the code to prevent such things in the future, but it's a pretty heavyweight pattern to expect folks to remember every time: defer func() {
switch r := recover().(type) {
case nil:
return
case error:
err = errors.Wrapf(r, "failed to %s", op)
default:
err = errors.AssertionFailedf("failed to %s: %v", op, r)
}
}() cockroach/pkg/sql/distsql_physical_planner.go Line 960 in 8ba916b
|
This doesn't feel like it belongs in this change. If this change was hitting panics, it should just check If you want to make a general argument that we're using ctxgroup all over bulk ops and schema changes and that across all these we don't want operations in the workers that panic to kill the whole proc, we should make that argument in a standalone issue or change, right? |
I also did that. Yes, that's definitely a right thing to do, but it seems stupid to not defensively translate the panics, as we do during all query execution and job execution. |
Yes, sorry for using this PR as a conduit for a conversation about a separate change, it was just top-of-mind. |
I'm all for cleaning/improving along the way / when a change highlights something that could be better, but in this case spawning/running separate goroutines that don't have recovers is pervasive throughout bulk and schema change code and I'd rather approach it all at once if we think it should change, not just one-off for this one case. |
#76734 is the issue. |
cea9584
to
bec814d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good, just very minor things that can be dealt with later
Reviewed 7 of 7 files at r1, 5 of 5 files at r2, 1 of 1 files at r3, 18 of 22 files at r4, 1 of 2 files at r5, 2 of 2 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @dt)
pkg/ccl/backupccl/restore_planning.go, line 1231 at r6 (raw file):
} var ids []descpb.ID switch e := t.Element().(type) {
This stuff does worry me, currently, I think it's okay. But longer-term it feels like something will be missed and cause breakages. Maybe some how have attributes for these in the future
pkg/sql/schemachanger/sctest/cumulative.go, line 454 at r6 (raw file):
WHERE "parentID" = (SELECT id FROM system.namespace WHERE name = $1) )`, dbName) afterRestore := tdb.QueryStr(t, fetchDescriptorStateQuery)
I guess this isn't an issue right now, but any danger from fields that need to be filtered? I guess we would need to regex them out (things like timestamps)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @fqazi, and @postamar)
pkg/ccl/backupccl/restore_planning.go, line 1231 at r6 (raw file):
Previously, fqazi (Faizan Qazi) wrote…
This stuff does worry me, currently, I think it's okay. But longer-term it feels like something will be missed and cause breakages. Maybe some how have attributes for these in the future
Me too, I'll put a TODO. @postamar reports that these slices are going away.
pkg/sql/schemachanger/sctest/cumulative.go, line 454 at r6 (raw file):
Previously, fqazi (Faizan Qazi) wrote…
I guess this isn't an issue right now, but any danger from fields that need to be filtered? I guess we would need to regex them out (things like timestamps)
this is using the SHOW CREATE
strings to compare the tables as opposed to the descriptor values for exactly that reason. It's not perfect, but it's a pretty good proxy for the actual state of the descriptor.
I'm going to add a call to also validate all of the descriptors in the cluster here and call it a day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @fqazi, and @postamar)
pkg/ccl/backupccl/restore_planning.go, line 1231 at r6 (raw file):
Previously, ajwerner wrote…
Me too, I'll put a TODO. @postamar reports that these slices are going away.
I'm realizing that this needs to find all of the expressions in the structs too and it needs to rewrite those. Maybe reflection is in order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only looked at 4th commit and I only looked closely-ish at the backupccl bits but overall I like it, and really like the move towards calling specific helpers over just tons and tons of inline descriptor work in big restore functions.
"github.com/stretchr/testify/require" | ||
"golang.org/x/sync/errgroup" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2¢ I just use ctxgroup even if I don't need the ctx yet.
return nil | ||
} | ||
|
||
func rewriteSchemaChangerState( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm tempted to say rewriteSchemaChangerState should be in an sc package.
I mean, I'm even tempted to say all these existing rewrite methods that dig into catalog elements and replace one ID with another should move to an sc package that restore just calls into, but I understand that might be out of scope here, so I guess this adding one here along side keeps it consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I thought about it and a lot of symbols would have to move. I directionally agree, but not for this PR.
pkg/ccl/backupccl/restore_job.go
Outdated
return all.Catalog, nil | ||
} | ||
|
||
// createDeclarativeSchemaChangeJobs is called during the last phase of a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like a good candidate to be in a sc package. could genericize the comment slightly if it was moved to such, to say "when ingesting external descriptors such as in the last phase of restore"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this out to a new package scbackup
.
bec814d
to
7c78223
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, @fqazi, and @postamar)
pkg/ccl/backupccl/restore_planning.go, line 1231 at r6 (raw file):
Previously, ajwerner wrote…
I'm realizing that this needs to find all of the expressions in the structs too and it needs to rewrite those. Maybe reflection is in order.
Okay, the reflection dust has been sprinkled onto this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a quick look at the latest set of commits. Happy with the reflection-based approach!
Reviewed 26 of 26 files at r7, 5 of 5 files at r8, 13 of 13 files at r9, 3 of 3 files at r10, 24 of 24 files at r11, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @dt, @fqazi, and @postamar)
pkg/sql/schemachanger/screl/walk.go, line 24 at r10 (raw file):
// WalkDescIDs calls f for every catid.DescID field in e. func WalkDescIDs(e scpb.Element, f func(id *catid.DescID) error) error {
Love this approach much cleaner!
pkg/sql/schemachanger/screl/walk_test.go, line 31 at r10 (raw file):
// as advertised. func TestWalk(t *testing.T) {
Nit: extra new line
7c78223
to
b634cde
Compare
This reworking is a precursor to adding backup/restore testing. Pause testing is also extremely valuable. It caught a bug when we resume after having finished a backfill but before we marked it as complete. This was bad for two reasons: 1) We weren't catching the panic on those goroutines (lesson learned, again) 2) We weren't detecting the case where we were done. Release note: None
This should really use the namespace element, but I needed this fix to not break some other tests I'm writing, so it's just a hack. Release note: None
This is handy to know what in a proto message is intended to be a sql expression. Release note: None
We'll need this to rewrite these parts in restore. Release note: None
This commit touches the schema changer to encode some more metadata about the state of the schema change into the descriptors in order to better facilitate synthesizing a job at restore time. The major contribution, then, is to support the declarative schema changer in restore. Along the way, it picks up some catalog niceties like batches descriptor retreival and the nstree.Catalog. Finally, the code is tested with a suite of datadriven tests which run schema changes and then take the state and BACKUP/RESTORE it at each of the possible stages (including all of the rollback stages) and makes sure that the right outcome ultimately happens. Note that this is currently just tested for database-level BACKUP/RESTORE and that more testing is planned as follow-up. Release note: None
b634cde
to
c10b5ef
Compare
TFTR! bors r+ |
Build failed (retrying...): |
Build succeeded: |
schemachanger,backupccl: support for backup and restore mid-change
This commit touches the schema changer to encode some more metadata about the
state of the schema change into the descriptors in order to better facilitate
synthesizing a job at restore time.
The major contribution, then, is to support the declarative schema changer in
restore. Along the way, it picks up some catalog niceties like batches
descriptor retreival and the nstree.Catalog.
Finally, the code is tested with a suite of datadriven tests which run schema
changes and then take the state and BACKUP/RESTORE it at each of the possible
stages (including all of the rollback stages) and makes sure that the right
outcome ultimately happens.
Note that this is currently just tested for database-level BACKUP/RESTORE
and that more testing is planned as follow-up.
sql/schemachanger: move where we remove namespace entries to be earlier
This should really use the namespace element, but I needed this fix
to not break some other tests I'm writing.
backupccl: don't back up dropped databases or schemas
Hack to deal with bugs that Aditya is fixing concurrently in #76635.
sql/schemachanger: add more end-to-end testing
This reworking is a precursor to adding backup/restore testing. Pause testing
is also extremely valuable. It caught a bug when we resume after having
finished a backfill but before we marked it as complete. This was bad
for two reasons:
Fixes #73071
Release note: None