Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server,migrations: ensure that no old version can serve traffic #66606

Closed
ajwerner opened this issue Jun 17, 2021 · 11 comments · Fixed by #91071
Closed

server,migrations: ensure that no old version can serve traffic #66606

ajwerner opened this issue Jun 17, 2021 · 11 comments · Fixed by #91071
Assignees
Labels
A-multitenancy Related to multi-tenancy A-server-architecture Relates to the internal APIs and src org for server code branch-master Failures and bugs on the master branch. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-sre For issues SRE opened or otherwise cares about tracking. T-shared-systems Shared Systems Team

Comments

@ajwerner
Copy link
Contributor

ajwerner commented Jun 17, 2021

Is your feature request related to a problem? Please describe.

In kv nodes we take great care to ensure that old code cannot connect to the cluster before running long-running migrations. For tenants we don't have any such guard rails. Today, with a single instance, this is ~fine (we should add a sanity check that we don’t downgrade post finalization, PR inbound). However, we’ll need to do something more with multiple nodes.

The mechanism by which we do this in the host cluster is via the migration.Cluster interface which has an important method UntilClusterStable. Today this implementation is a no-op for tenant clusters.

// UntilClusterStable is part of the migration.Cluster interface.
//
// Tenant clusters in the current version assume their cluster is stable
// because they presently assume there is at most one running SQL pod. When
// that changes, this logic will need to change.
func (t TenantCluster) UntilClusterStable(ctx context.Context, fn func() error) error {
return nil
}

Describe the solution you'd like

  1. The very first step is add sanity checking during tenant server startup to ensure that the version is high enough before serving sql traffic.

  2. Next we need to add a protocol to ensure that old instances will never join the cluster. The proposed solution here piggy-backs off of 1) and server,sql: subsystem to provide unique instance IDs and routing metadata for sql pods #62818.

If we were to make sure that all sql instances check the cluster setting for their version after they write their entry into the instances table, and fail to start up if their version is too old, then we'd know that no instance that is too old will actually start serving traffic. Now, we'll also want to check the version before writing our entry to not starve the whole process. Lastly, inside the UntilClusterStable, we'll want to make an RPC to each instance and check it's version.

Describe alternatives you've considered

Another approach would be to actually just write the binary version for each instance into the instances table. I sort of like that approach too. Then instead of making RPCs, we could just scan the instances table.

Epic: CRDB-10829

Jira issue: CRDB-8144

@ajwerner ajwerner added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jun 17, 2021
@knz knz added the A-multitenancy Related to multi-tenancy label Jun 18, 2021
@tbg
Copy link
Member

tbg commented Jun 23, 2021

@lunevalex can you figure out which team owns this kind of issue? There isn't a compelling argument that it should be KV, as this is primarily about SQL tenants.
The answer might be that KV is the only place to put it for lack of any other team being able to provide ownership, but I doubt that.

@ajwerner
Copy link
Contributor Author

Increasingly I think of this as the server team. It's related to server startup. How do y'all feel about that position?

@lunevalex
Copy link
Collaborator

@ajwerner I think that is a good suggestion, @knz curious to hear your thoughts?

@knz
Copy link
Contributor

knz commented Jun 23, 2021

agreed

@knz
Copy link
Contributor

knz commented Jun 23, 2021

that being said we need a knowledge transfer first (like for the rpc/networking code)

@lunevalex
Copy link
Collaborator

I belive @ajwerner would be the person with knowledge in this area

@ajwerner
Copy link
Contributor Author

ajwerner commented Aug 4, 2021

Is related to #68116.

@blathers-crl blathers-crl bot added the T-server-and-security DB Server & Security label Oct 1, 2021
@lunevalex lunevalex added the A-server-architecture Relates to the internal APIs and src org for server code label Oct 1, 2021
@ajwerner ajwerner added branch-master Failures and bugs on the master branch. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Dec 14, 2021
@joshimhoff joshimhoff added the O-sre For issues SRE opened or otherwise cares about tracking. label Feb 15, 2022
@jtsiros jtsiros added T-shared-systems Shared Systems Team and removed T-server-and-security DB Server & Security labels Feb 17, 2022
@ajstorm
Copy link
Collaborator

ajstorm commented Mar 14, 2022

Removing the release blocker flag as discussed here: https://cockroachlabs.slack.com/archives/C02HWA24541/p1644865125179499.

@ajstorm ajstorm removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Mar 14, 2022
ajstorm added a commit to ajstorm/cockroach that referenced this issue Nov 4, 2022
This commit adds guardrails around tenant upgrades to ensure that the tenant
can never be upgraded if it means that it will no longer be operable with the
existing host/storage cluster. Before this change, it was possible to upgrade a
tenant at any time. This was unsafe as it implied that the tenant could end up
on a higher version that the host/storage cluster and may require upgrades that
were not yet present.

To communicate the host/storage cluster version to the secondary tenants we
leverage the existing cluster settings override mechanism and override the
"version" setting. In practice this doesn't override the version value for the
tenant because the tenant will always have a version setting thus negating the
override. Instead, it communicates to the settingswatcher that the version has
been overridden and upon getting that communication (or on startup) the
settingswatcher caches the override value as the host cluster version.

The change also requires sending a new override on upgrade, and seeding a new
value in a startup migration.

Closes: cockroachdb#66606

Release note: None
craig bot pushed a commit that referenced this issue Nov 8, 2022
91071: multi-tenant: Add guardrails around tenant upgrade r=ajwerner a=ajstorm

This commit adds guardrails around tenant upgrades to ensure that the tenant can never be upgraded if it means that it will no longer be operable with the existing host/storage cluster. Before this change, it was possible to upgrade a tenant at any time. This was unsafe as it implied that the tenant could end up on a higher version that the host/storage cluster and may require upgrades that were not yet present.

To communicate the host/storage cluster version to the secondary tenants we leverage the existing cluster settings override mechanism and override the "version" setting. In practice this doesn't override the version value for the tenant because the tenant will always have a version setting thus negating the override. Instead, it communicates to the settingswatcher that the version has been overridden and upon getting that communication (or on startup) the settingswatcher caches the override value as the host cluster version.

The change also requires sending a new override on upgrade, and seeding a new value in a startup migration.

Release note: None

Closes: #66606

Original PR explanation follows for historical purposes (the approach proposed herein was deemed acceptable)
---------------------------------------------------------------------------------------------------------------

This PR contains a proposal for how we may want to solve the problem of guardrails around tenant upgrades.  This is a known problem as described in #66606, with the solution being to block tenant upgrades if they will push the tenant version beyond the host cluster version.  The problem however, is that there's no existing way for a secondary tenant to read the host cluster version.  The commit below proposes a solution whereby we plumb the host cluster version into secondary tenants by leveraging the existing cluster settings override mechanism. 

 When a cluster is initialized or upgraded, it will populate the version information in `system.tenant_settings` using the `version` key.  When secondary tenants see this value in their `settingswatcher` they will not actually override the tenant's cluster version, but instead will cache the value as the host cluster version.  

In a recent multi-tenant conversation, there was some debate as to whether or not this was a reasonable way to inform secondary tenants as to the updated cluster version.  An alternative approach was proposed as follows:

* Teach the tenant `settingswatcher` to also watch the settings table via a rangefeed.
* Filter updates to settings that aren't meant to be visible to the tenant.
* Deliver the updates to the `settingswatcher` via the existing streaming RPC (but with a new field in the TenantSettingsEvent).
* On the settingswatcher side, on receipt of this RPC, cache the host cluster version

Since I had most of the first approach implemented, I present it here as an option (with accompanying code) and we can debate on this PR review if I should abandon the first approach in favour of the second (or some entirely different approach).

Co-authored-by: Adam Storm <[email protected]>
@craig craig bot closed this as completed in fd42d37 Nov 8, 2022
@ajstorm
Copy link
Collaborator

ajstorm commented Nov 8, 2022

Reopening. #91071 only addressed part of this issue. We still need to prevent SQL pods that are too old from connecting to up-level host clusters.

@ajstorm ajstorm reopened this Nov 8, 2022
@healthy-pod healthy-pod self-assigned this Jan 12, 2023
@ajstorm
Copy link
Collaborator

ajstorm commented Mar 7, 2023

@healthy-pod Can you confirm that you've fixed this last part of the issue? If so, can you close this?

@healthy-pod
Copy link
Contributor

This was completed in #94973. If the binary version of a pre-starting SQL server is too low for the tenant logical version as read from storage, it will fail to start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-multitenancy Related to multi-tenancy A-server-architecture Relates to the internal APIs and src org for server code branch-master Failures and bugs on the master branch. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-sre For issues SRE opened or otherwise cares about tracking. T-shared-systems Shared Systems Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants