server,migrations: ensure that no old version can serve traffic #66606

ajwerner · 2021-06-17T18:40:43Z

Is your feature request related to a problem? Please describe.

In kv nodes we take great care to ensure that old code cannot connect to the cluster before running long-running migrations. For tenants we don't have any such guard rails. Today, with a single instance, this is ~fine (we should add a sanity check that we don’t downgrade post finalization, PR inbound). However, we’ll need to do something more with multiple nodes.

The mechanism by which we do this in the host cluster is via the migration.Cluster interface which has an important method UntilClusterStable. Today this implementation is a no-op for tenant clusters.

cockroach/pkg/migration/migrationcluster/tenant_cluster.go

Lines 142 to 149 in 4758a22

    
           // UntilClusterStable is part of the migration.Cluster interface. 
        
           // 
        
           // Tenant clusters in the current version assume their cluster is stable 
        
           // because they presently assume there is at most one running SQL pod. When 
        
           // that changes, this logic will need to change. 
        
           func (t TenantCluster) UntilClusterStable(ctx context.Context, fn func() error) error { 
        
           	return nil 
        
           }

Describe the solution you'd like

The very first step is add sanity checking during tenant server startup to ensure that the version is high enough before serving sql traffic.
Next we need to add a protocol to ensure that old instances will never join the cluster. The proposed solution here piggy-backs off of 1) and server,sql: subsystem to provide unique instance IDs and routing metadata for sql pods #62818.

If we were to make sure that all sql instances check the cluster setting for their version after they write their entry into the instances table, and fail to start up if their version is too old, then we'd know that no instance that is too old will actually start serving traffic. Now, we'll also want to check the version before writing our entry to not starve the whole process. Lastly, inside the UntilClusterStable, we'll want to make an RPC to each instance and check it's version.

Describe alternatives you've considered

Another approach would be to actually just write the binary version for each instance into the instances table. I sort of like that approach too. Then instead of making RPCs, we could just scan the instances table.

Epic: CRDB-10829

Jira issue: CRDB-8144

The text was updated successfully, but these errors were encountered:

tbg · 2021-06-23T08:07:22Z

@lunevalex can you figure out which team owns this kind of issue? There isn't a compelling argument that it should be KV, as this is primarily about SQL tenants.
The answer might be that KV is the only place to put it for lack of any other team being able to provide ownership, but I doubt that.

ajwerner · 2021-06-23T11:14:52Z

Increasingly I think of this as the server team. It's related to server startup. How do y'all feel about that position?

lunevalex · 2021-06-23T16:50:06Z

@ajwerner I think that is a good suggestion, @knz curious to hear your thoughts?

knz · 2021-06-23T17:02:06Z

agreed

knz · 2021-06-23T17:02:23Z

that being said we need a knowledge transfer first (like for the rpc/networking code)

lunevalex · 2021-06-23T17:04:01Z

I belive @ajwerner would be the person with knowledge in this area

ajwerner · 2021-08-04T15:06:40Z

Is related to #68116.

ajstorm · 2022-03-14T12:37:22Z

Removing the release blocker flag as discussed here: https://cockroachlabs.slack.com/archives/C02HWA24541/p1644865125179499.

This commit adds guardrails around tenant upgrades to ensure that the tenant can never be upgraded if it means that it will no longer be operable with the existing host/storage cluster. Before this change, it was possible to upgrade a tenant at any time. This was unsafe as it implied that the tenant could end up on a higher version that the host/storage cluster and may require upgrades that were not yet present. To communicate the host/storage cluster version to the secondary tenants we leverage the existing cluster settings override mechanism and override the "version" setting. In practice this doesn't override the version value for the tenant because the tenant will always have a version setting thus negating the override. Instead, it communicates to the settingswatcher that the version has been overridden and upon getting that communication (or on startup) the settingswatcher caches the override value as the host cluster version. The change also requires sending a new override on upgrade, and seeding a new value in a startup migration. Closes: cockroachdb#66606 Release note: None

91071: multi-tenant: Add guardrails around tenant upgrade r=ajwerner a=ajstorm This commit adds guardrails around tenant upgrades to ensure that the tenant can never be upgraded if it means that it will no longer be operable with the existing host/storage cluster. Before this change, it was possible to upgrade a tenant at any time. This was unsafe as it implied that the tenant could end up on a higher version that the host/storage cluster and may require upgrades that were not yet present. To communicate the host/storage cluster version to the secondary tenants we leverage the existing cluster settings override mechanism and override the "version" setting. In practice this doesn't override the version value for the tenant because the tenant will always have a version setting thus negating the override. Instead, it communicates to the settingswatcher that the version has been overridden and upon getting that communication (or on startup) the settingswatcher caches the override value as the host cluster version. The change also requires sending a new override on upgrade, and seeding a new value in a startup migration. Release note: None Closes: #66606 Original PR explanation follows for historical purposes (the approach proposed herein was deemed acceptable) --------------------------------------------------------------------------------------------------------------- This PR contains a proposal for how we may want to solve the problem of guardrails around tenant upgrades. This is a known problem as described in #66606, with the solution being to block tenant upgrades if they will push the tenant version beyond the host cluster version. The problem however, is that there's no existing way for a secondary tenant to read the host cluster version. The commit below proposes a solution whereby we plumb the host cluster version into secondary tenants by leveraging the existing cluster settings override mechanism. When a cluster is initialized or upgraded, it will populate the version information in `system.tenant_settings` using the `version` key. When secondary tenants see this value in their `settingswatcher` they will not actually override the tenant's cluster version, but instead will cache the value as the host cluster version. In a recent multi-tenant conversation, there was some debate as to whether or not this was a reasonable way to inform secondary tenants as to the updated cluster version. An alternative approach was proposed as follows: * Teach the tenant `settingswatcher` to also watch the settings table via a rangefeed. * Filter updates to settings that aren't meant to be visible to the tenant. * Deliver the updates to the `settingswatcher` via the existing streaming RPC (but with a new field in the TenantSettingsEvent). * On the settingswatcher side, on receipt of this RPC, cache the host cluster version Since I had most of the first approach implemented, I present it here as an option (with accompanying code) and we can debate on this PR review if I should abandon the first approach in favour of the second (or some entirely different approach). Co-authored-by: Adam Storm <[email protected]>

ajstorm · 2022-11-08T15:28:46Z

Reopening. #91071 only addressed part of this issue. We still need to prevent SQL pods that are too old from connecting to up-level host clusters.

ajstorm · 2023-03-07T13:45:58Z

@healthy-pod Can you confirm that you've fixed this last part of the issue? If so, can you close this?

healthy-pod · 2023-03-07T23:49:42Z

This was completed in #94973. If the binary version of a pre-starting SQL server is too low for the tenant logical version as read from storage, it will fail to start.

ajwerner added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jun 17, 2021

knz added the A-multitenancy Related to multi-tenancy label Jun 18, 2021

ajwerner mentioned this issue Jun 22, 2021

Tenant SQL pod at an incompatible binary version gets unexpected behaviour #66721

Closed

blathers-crl bot added the T-server-and-security DB Server & Security label Oct 1, 2021

lunevalex added the A-server-architecture Relates to the internal APIs and src org for server code label Oct 1, 2021

ajwerner added branch-master Failures and bugs on the master branch. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Dec 14, 2021

joshimhoff added the O-sre For issues SRE opened or otherwise cares about tracking. label Feb 15, 2022

jtsiros added T-shared-systems Shared Systems Team and removed T-server-and-security DB Server & Security labels Feb 17, 2022

ajstorm removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Mar 14, 2022

ajwerner mentioned this issue Mar 16, 2022

sql: enable changing the version setting of a tenant from another tenant #77733

Closed

ajstorm mentioned this issue Nov 1, 2022

multi-tenant: Add guardrails around tenant upgrade #91071

Merged

craig bot closed this as completed in fd42d37 Nov 8, 2022

ajstorm reopened this Nov 8, 2022

healthy-pod self-assigned this Jan 12, 2023

healthy-pod closed this as completed Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server,migrations: ensure that no old version can serve traffic #66606

server,migrations: ensure that no old version can serve traffic #66606

ajwerner commented Jun 17, 2021 •

edited by cockroach-jira-scripts

Loading

tbg commented Jun 23, 2021

ajwerner commented Jun 23, 2021

lunevalex commented Jun 23, 2021

knz commented Jun 23, 2021

knz commented Jun 23, 2021

lunevalex commented Jun 23, 2021

ajwerner commented Aug 4, 2021

ajstorm commented Mar 14, 2022

ajstorm commented Nov 8, 2022

ajstorm commented Mar 7, 2023

healthy-pod commented Mar 7, 2023

server,migrations: ensure that no old version can serve traffic #66606

server,migrations: ensure that no old version can serve traffic #66606

Comments

ajwerner commented Jun 17, 2021 • edited by cockroach-jira-scripts Loading

tbg commented Jun 23, 2021

ajwerner commented Jun 23, 2021

lunevalex commented Jun 23, 2021

knz commented Jun 23, 2021

knz commented Jun 23, 2021

lunevalex commented Jun 23, 2021

ajwerner commented Aug 4, 2021

ajstorm commented Mar 14, 2022

ajstorm commented Nov 8, 2022

ajstorm commented Mar 7, 2023

healthy-pod commented Mar 7, 2023

ajwerner commented Jun 17, 2021 •

edited by cockroach-jira-scripts

Loading