-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server,migrations: ensure that no old version can serve traffic #66606
Comments
@lunevalex can you figure out which team owns this kind of issue? There isn't a compelling argument that it should be KV, as this is primarily about SQL tenants. |
Increasingly I think of this as the server team. It's related to server startup. How do y'all feel about that position? |
agreed |
that being said we need a knowledge transfer first (like for the rpc/networking code) |
I belive @ajwerner would be the person with knowledge in this area |
Is related to #68116. |
Removing the release blocker flag as discussed here: https://cockroachlabs.slack.com/archives/C02HWA24541/p1644865125179499. |
This commit adds guardrails around tenant upgrades to ensure that the tenant can never be upgraded if it means that it will no longer be operable with the existing host/storage cluster. Before this change, it was possible to upgrade a tenant at any time. This was unsafe as it implied that the tenant could end up on a higher version that the host/storage cluster and may require upgrades that were not yet present. To communicate the host/storage cluster version to the secondary tenants we leverage the existing cluster settings override mechanism and override the "version" setting. In practice this doesn't override the version value for the tenant because the tenant will always have a version setting thus negating the override. Instead, it communicates to the settingswatcher that the version has been overridden and upon getting that communication (or on startup) the settingswatcher caches the override value as the host cluster version. The change also requires sending a new override on upgrade, and seeding a new value in a startup migration. Closes: cockroachdb#66606 Release note: None
91071: multi-tenant: Add guardrails around tenant upgrade r=ajwerner a=ajstorm This commit adds guardrails around tenant upgrades to ensure that the tenant can never be upgraded if it means that it will no longer be operable with the existing host/storage cluster. Before this change, it was possible to upgrade a tenant at any time. This was unsafe as it implied that the tenant could end up on a higher version that the host/storage cluster and may require upgrades that were not yet present. To communicate the host/storage cluster version to the secondary tenants we leverage the existing cluster settings override mechanism and override the "version" setting. In practice this doesn't override the version value for the tenant because the tenant will always have a version setting thus negating the override. Instead, it communicates to the settingswatcher that the version has been overridden and upon getting that communication (or on startup) the settingswatcher caches the override value as the host cluster version. The change also requires sending a new override on upgrade, and seeding a new value in a startup migration. Release note: None Closes: #66606 Original PR explanation follows for historical purposes (the approach proposed herein was deemed acceptable) --------------------------------------------------------------------------------------------------------------- This PR contains a proposal for how we may want to solve the problem of guardrails around tenant upgrades. This is a known problem as described in #66606, with the solution being to block tenant upgrades if they will push the tenant version beyond the host cluster version. The problem however, is that there's no existing way for a secondary tenant to read the host cluster version. The commit below proposes a solution whereby we plumb the host cluster version into secondary tenants by leveraging the existing cluster settings override mechanism. When a cluster is initialized or upgraded, it will populate the version information in `system.tenant_settings` using the `version` key. When secondary tenants see this value in their `settingswatcher` they will not actually override the tenant's cluster version, but instead will cache the value as the host cluster version. In a recent multi-tenant conversation, there was some debate as to whether or not this was a reasonable way to inform secondary tenants as to the updated cluster version. An alternative approach was proposed as follows: * Teach the tenant `settingswatcher` to also watch the settings table via a rangefeed. * Filter updates to settings that aren't meant to be visible to the tenant. * Deliver the updates to the `settingswatcher` via the existing streaming RPC (but with a new field in the TenantSettingsEvent). * On the settingswatcher side, on receipt of this RPC, cache the host cluster version Since I had most of the first approach implemented, I present it here as an option (with accompanying code) and we can debate on this PR review if I should abandon the first approach in favour of the second (or some entirely different approach). Co-authored-by: Adam Storm <[email protected]>
Reopening. #91071 only addressed part of this issue. We still need to prevent SQL pods that are too old from connecting to up-level host clusters. |
@healthy-pod Can you confirm that you've fixed this last part of the issue? If so, can you close this? |
This was completed in #94973. If the binary version of a pre-starting SQL server is too low for the tenant logical version as read from storage, it will fail to start. |
Is your feature request related to a problem? Please describe.
In kv nodes we take great care to ensure that old code cannot connect to the cluster before running long-running migrations. For tenants we don't have any such guard rails. Today, with a single instance, this is ~fine (we should add a sanity check that we don’t downgrade post finalization, PR inbound). However, we’ll need to do something more with multiple nodes.
The mechanism by which we do this in the host cluster is via the
migration.Cluster
interface which has an important methodUntilClusterStable
. Today this implementation is a no-op for tenant clusters.cockroach/pkg/migration/migrationcluster/tenant_cluster.go
Lines 142 to 149 in 4758a22
Describe the solution you'd like
The very first step is add sanity checking during tenant server startup to ensure that the version is high enough before serving sql traffic.
Next we need to add a protocol to ensure that old instances will never join the cluster. The proposed solution here piggy-backs off of 1) and server,sql: subsystem to provide unique instance IDs and routing metadata for sql pods #62818.
If we were to make sure that all sql instances check the cluster setting for their version after they write their entry into the instances table, and fail to start up if their version is too old, then we'd know that no instance that is too old will actually start serving traffic. Now, we'll also want to check the version before writing our entry to not starve the whole process. Lastly, inside the
UntilClusterStable
, we'll want to make an RPC to each instance and check it's version.Describe alternatives you've considered
Another approach would be to actually just write the binary version for each instance into the instances table. I sort of like that approach too. Then instead of making RPCs, we could just scan the instances table.
Epic: CRDB-10829
Jira issue: CRDB-8144
The text was updated successfully, but these errors were encountered: