Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/schemachanger,backupccl: plot a path for adoption of the declarative schema changer #73071

Closed
ajwerner opened this issue Nov 23, 2021 · 6 comments · Fixed by #76715
Closed
Assignees
Labels
A-disaster-recovery A-schema-changer-impl Related to the implementation of the new schema changer A-schema-transactional C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@ajwerner
Copy link
Contributor

ajwerner commented Nov 23, 2021

Overview

Restores currently jump through some hoops to leverage the existing, imperative schema changer when restoring descriptors which are currently mid-schema change. These same mechanisms won't perfectly work for the declarative schema changer. This issue encompasses both

  • plotting a plan for restoring descriptors which are mid-schema change in the declarative schema changer, and
  • restoring descriptors which preceded the declarative schema changer after we've removed the old code.
Background

In the old world, schema changes operated on a per-descriptor basis and all of their state was implied by the descriptor itself (modulo some description sort of things in the job). This was something of a boon to BACKUP/RESTORE; from a descriptor, the job could be recreated relatively trivially as it didn’t have much state (not that trivially, there’s still a bunch of complexity related to filtering out things that maybe don’t exist).

The above scheme has several known problems which we're trying to fix:

  • Schema changes need to be grouped by transactions and we’d like cross-descriptor rollback or completion.
  • The tight coupling of implicit schema changer state and descriptor logical/physical state has led to much confusion and bad code.
  • The queuing of mutations is devilishly complex.

The declarative schema changer seeks to move away from the descriptor mutation world. Changes will be decomposed into sets of schema elements to be added or removed. These elements will traverse statuses. The planner will formulate a plan based solely on the current set of (element, direction, status) tuples. This is nice because it lets us decompose a bunch of logic in order to build more robust primitives.

A downside is that, from the descriptor alone, we can’t really reconstruct the job state at full fidelity. For a little while, we probably can reconstruct it sufficiently. At the end of the day, we’re not eager to break compatibility with existing descriptors (way too much headache for their users) and we’re trying to not stray too far from the concepts of descriptors which already exist.

Problem statement
  • At some point down the road, we’ll want to restore a descriptor which preceded the declarative schema changer, yet, we’ll also want to have deleted the schema changer code as it existed when that descriptor was created.
  • At a closer point, we’ll want to restore descriptors which are undergoing schema changes using the declarative schema changer.

Describe the solution you'd like

One family of solutions involves moving more of the schema change state back onto the descriptors. One could imagine that we retain an in-sync set of elements and their statuses on the descriptors and in the job. One could go further and just have the job reference the descriptors. This actually wouldn't be that big of a change. There'd still be some work to figure out which elements to trash when restoring. This would be new work, but it wouldn't be too crazy, I don't think. That would give us a path forward in terms of restoring these descriptors in the absence of jobs.

Separately, we could then write migrations which synthesize elements and their statuses from mutations. This does not need to be done until we want to remove the old schema changer.

Additional context

  • By having all of the state in the descriptors instead of in the jobs, we may run afoul of other versioning considerations for full cluster restore.

    • Need to really nail down the model for jobs crossing versions.
  • In the transactional schema change RFC, there’s a notion of a descriptor which is in a state that needs to be rolled back. This is a bit hand-wavy in the RFC.

    • There are tradeoffs here regarding how exactly to atomically add constraints without exposing them to the client before the constraint is committed.
    • In general, there’s a feeling that it’d be okay to ship a version of transactional schema changes which exposed, very briefly, clients to enforcing a constraint which has not yet been committed. The alternative, more complex, approach is to have the execution of the statement preempt the schema change. This would probably be better, but not definitely.
      • “There’s isolation, then there’s isolation”
    • This cleanup differs from what we’d want to do in restore in that it always assumes we want to rollback whereas restore may want to roll forward.

Epic CRDB-2356

@ajwerner ajwerner added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-disaster-recovery A-schema-changer-impl Related to the implementation of the new schema changer A-schema-transactional labels Nov 23, 2021
@blathers-crl
Copy link

blathers-crl bot commented Nov 23, 2021

cc @cockroachdb/bulk-io

@blathers-crl blathers-crl bot added the T-sql-schema-deprecated Use T-sql-foundations instead label Nov 23, 2021
@ajwerner ajwerner self-assigned this Nov 23, 2021
@postamar
Copy link
Contributor

I've added an epic link in the description, because we'll have to do something about this if we're to use the declarative schema changer in production in 22.1.

@ajwerner
Copy link
Contributor Author

Spoke with @dt about this. More or less we seemed to agree on the suggested proposal. The feeling was that we will be better off duplicating even the metadata of the statements to each descriptor in play. In practice, this looks like:

  • Move all of the elements of a schema change job from the job itself to the descriptors in a new message[s].
  • Additionally add the metadata regarding statements and authorization also to the descriptors.
    • Sure, this means that we're going to write this metadata multiple time, I don't care.
    • It does leave some room for corruption, also don't care.
  • Have the job point to all of the descriptors. It will mean going and fetching all of the descriptors. From a latency perspective, this relates to sql/catalog: provide batched access to descriptors #64388.

This approach has the nice property that we can, given a set of descriptors in a restore, retain the relevant parts of the schema change state machine without needing to involve jobs everywhere. Until we can get out of the game of dealing with descriptors in backups altogether (🤞, the only way we're ever going to make BACKUP/RESTORE actually bulletproof), this seems like the least bad option.

Eventually, when we go to remove the old schema changer, we'll need to do something about upgrading descriptor representations and mapping the old schema changer state into the new one. That, or we retain the code but disallow its use for at least a version. I hear there are upcoming new limitations on the lifetime of support for backups.

Similarly, we'll need to gain some invariants around how old a running job is allowed to be. I'll file a separate issue about that.

@postamar
Copy link
Contributor

This all sounds reasonable to me. I guess this implies we should also replace the state in the SchemaChangerState in the connExecutor with a set of descriptor IDs.

Beyond backups, this does feel conceptually nicer somehow. Effectively persisting the schema changer state in the descriptors will have the nice side-benefit of forcing us to be more careful about backward compatibility and helps us avoid doing dumb things. Having nothing in the new-schema-change job details (not including the descriptor IDs at the job top level) makes that more future-proof too, which honestly is a relief because I'm not sure that that's well tested in mixed-version clusters ("stmt gets executed on new node, job gets adopted by old node" scenarios or vice-versa). This also leverages the possibilities offered by descriptor validation and post-deserialization changes, etc.

@ajwerner
Copy link
Contributor Author

Nice, glad we've found a happy place to land on this. I'm going to assign myself this and take it for December. It'd be cool to have a backup and restore test of running schema changes using the declarative framework before the new year.

@ajwerner ajwerner added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Dec 14, 2021
@blathers-crl
Copy link

blathers-crl bot commented Dec 14, 2021

Hi @ajwerner, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

craig bot pushed a commit that referenced this issue Feb 12, 2022
76116: schemachanger,scpb,catalog: move schema changer state into descriptors r=ajwerner a=ajwerner

This work is in support of #73071. The first two commit breaks dependencies so descpb can import scpb. Then we add the state to the descriptor protos. Finally we adopt for use. This does not yet test or synthesize the job.

Co-authored-by: Andrew Werner <[email protected]>
@craig craig bot closed this as completed in fff55f7 Feb 18, 2022
@exalate-issue-sync exalate-issue-sync bot added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-schema-deprecated Use T-sql-foundations instead labels May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery A-schema-changer-impl Related to the implementation of the new schema changer A-schema-transactional C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants