diff --git a/docs/RFCS/20211106_multitenant_cluster_settings.md b/docs/RFCS/20211106_multitenant_cluster_settings.md index e04af7393a95..6d666e745b19 100644 --- a/docs/RFCS/20211106_multitenant_cluster_settings.md +++ b/docs/RFCS/20211106_multitenant_cluster_settings.md @@ -1,9 +1,10 @@ - Feature Name: Multi-tenant cluster settings -- Status: accepted +- Status: completed - Start Date: 2021-11-06 -- Authors: Radu Berinde -- RFC PR: #73349 -- Cockroach Issue: #77935 +- Edited: 2023-09-27 +- Authors: Radu Berinde, knz, ssd +- RFC PR: [#85970](https://github.com/cockroachdb/cockroach/pull/85970), previously [#73349](https://github.com/cockroachdb/cockroach/pull/73349) +- Cockroach Issue: [#77935](https://github.com/cockroachdb/cockroach/issue/77935), [#85729](https://github.com/cockroachdb/cockroach/issues/85729) # Summary @@ -18,131 +19,165 @@ them apply exclusively to the KV subsystem; some apply only to the SQL layer. Yet others are harder to classify - for example, they may apply to an aspect of the KV subsystem, but the SQL layer also needs to interact with the setting. -Currently all cluster settings are treated homogeneously; their current values -are stored in the `system.settings` table. +Currently all cluster settings are treated homogeneously; their +current values are stored in the `system.settings` table of the system +tenant, at the level of the storage cluster. -In a multi-tenant deployment, the KV and SQL layers are separated. KV is handled -by a single shared host cluster; in contrast, each tenant runs its own separate -instance of the SQL layer, across multiple SQL pods (that form the tenant -"cluster"). +With cluster virtualization, the KV/storage and SQL layers are +separated. For example, KV is handled by a single shared storage +cluster; in contrast, each virtual cluster runs its own separate +instance of the SQL layer, across multiple SQL pods (that form the +"logical cluster"). -Currently each tenant has its own separate instance of all cluster settings (and -its `system.settings` table). Some settings are designated as `SystemOnly` to -indicate that they are only applicable to the system tenant (these settings are -not expected to be consulted by the tenant code). Tenants can freely change all -other settings, but only those that affect the SQL code run by the tenant will -make any difference. +As of this writing (2021) each virtual cluster has its own separate +instance of all cluster settings (and its own `system.settings` +table). Some settings are designated as `SystemOnly` to indicate that +they are only applicable to the storage layer (these settings are not +expected to be consulted by virtual cluster servers). Virtual clusters +can freely change all other settings, but only those that affect the +SQL code run inside the virtual cluster will make any difference. Beyond the obvious usability issues, there are important functional gaps: - - we need settings that can be read by the tenant process but which cannot be + - we need settings that can be read by VC server processes but which cannot be modified by the end-user. For example: controls for the RU accounting subsystem. - - in certain cases tenant code may need to consult values for cluster settings - that apply to the host cluster: for example + - in certain cases VC code may need to consult values for cluster settings + that apply to the storage cluster: for example `kv.closed_timestamp.follower_reads.enabled` applies to the KV subsystem but is read by the SQL code when serving queries. -### Note on SQL settings - -Many SQL features are controlled using a session setting / cluster setting pair. -The cluster setting is of the form `sql.defaults.*` and contains the default -value for the session setting. In a separate project, these cluster settings are -being deprecated in favor of database/role defaults (ALTER ROLE statement). This -doesn't affect the present proposal, other than to note that there will be much -fewer cluster settings that need to be controlled by the tenant. - # Technical design We propose splitting the cluster settings into three *classes*: -1. System only (`system`) +1. System only (`system-only`) - Settings associated with the host cluster, only usable by the system tenant. - These settings are not visible at all from other tenants. Settings code - prevents use of values for these settings from a tenant process. + Settings associated with the storage layer, only usable by the + system tenant. These settings are not visible at all from virtual + clusters. Settings code prevents use of values for these settings + from a VC server process. Example: `kv.allocator.qps_rebalance_threshold`. - -2. Tenant read-only (`tenant-ro`) - These settings are visible from non-system tenants but the tenants cannot - modify the values. +2. System visible `system-visible` (previously: "Tenant read-only `system-visible`") + + These settings are visible from virtual clusters but the virtual + clusters cannot modify the values. + + The observed value of settings in this class is: + + - by default, the value held for the setting in the system tenant + (the storage cluster's value in the system tenant's + `system.settings`). + - New SQL syntax allows the system tenant to set the value for a + specific tenant; this results in the tenant (asynchronously) + getting the updated value. (i.e. the value for one tenant can be + overridden away from the default) + + Examples: - New SQL syntax allows the system tenant to set the value for a specific - tenant; this results in the tenant (asynchronously) getting the updated - value. The system tenant can also set an "all tenants" default value with a - single command; the value applies to all tenants that don't have their own - specific value, including future tenants. + - Settings that affect the KV replication layer, should not be + writable by tenants, but which benefit the SQL layer in tenants: + `kv.raft.command.max_size`. - Examples: `kv.bulk_ingest.batch_size`, `tenant_cost_control.cpu_usage_allowance`. + - Settings that benefit from being overridden per tenant, but where + inheriting the system tenant's value when not overridden is OK: + `kv.bulk_ingest.batch_size`, `tenant_cpu_usage_allowance`. -3. Tenant writable (`tenant-rw`) +3. Application level (`application`, previously: "Tenant writable `tenant-rw`) - These settings are per tenant and can be modified by the tenant (as well as - the system tenant as above). The system tenant can override these settings - with the same syntax as above (effectively converting specific settings into - `tenant-ro`). + These settings are contained by each virtual cluster and can be + modified by the virtual cluster. They can also be overridden from + the system tenant using the same override mechanism as above. Example: `sql.notices.enabled`. +The difference between the three classes, and with/without override, is as follows: + +| Behavior | System only | System visible | Application writable | +|--------------------------------------------------------------------------------|----------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------|----------------------------------------------------------------| +| Value lookup order | N/A on virtual clusters; in system tenant, 1) local `settings` 2) compile-time default | 1) per-VC override 2) system tenant `settings` 3) compile-time default | 1) per-VC override 2) local `settings` 2) compile-time default | +| Can run (RE)SET CLUSTER SETTING in system tenant | yes | yes | yes | +| Can run (RE)SET CLUSTER SETTING in virtual cluster | no | no | yes | +| Can set virtual cluster override from system tenant | no | yes | yes | +| Value in current VC's `system.settings` is used as configuration | only in system tenant | no (local value always ignored) | yes, but only if there's no override | +| Default value when the current VC's `system.settings` does not contain a value | compile-time default | per-VC override if any, otherwise system tenant value, otherwise compile-time default | per-VC override if any, otherwise compile-time default | + +In effect, this means there's two ways to create a "read-only" setting +in virtual clusters: + +- using a "System visible" setting. In that case, the value is taken + from the system tenant and *shared across all tenants*. +- using an "Application writable" setting, and adding an override in + the system tenant. In that case, the value is taken from the + override and *can be specialized per virtual cluster*. + +When should one choose one over the other? The determination should be +done based on whether the configuration is system-wide or can be +meaningfully different per virtual cluster. + #### A note on the threat model -The described restrictions assume that the SQL tenant process is not -compromised. There is no way to prevent a compromised process from changing its -own view of the cluster settings. However, even a compromised process should -never be able to learn the values for the `System` settings. It's also worth -considering how a compromised tenant process can influence future uncompromised -processes. +The described restrictions assume that each virtual cluster server +process is not compromised. There is no way to prevent a compromised +process from changing its own view of the cluster settings. However, +even a compromised process should never be able to learn the values +for the "System only" settings or modify settings for other virtual +cluster. It's also worth considering how a compromised VC server +process can influence future uncompromised processes. ### SQL changes -New statements for the system tenant only (which concern only `tenant-ro` and -`tenant-rw` settings): - - `ALTER TENANT SET CLUSTER SETTING = ` - - Sets the value seen by tenant. For `tenant-rw`, this value will override - any setting from the tenant's side, until the cluster setting is reset. +New statements for the system tenant only: + + - `ALTER VIRTUAL CLUSTER SET CLUSTER SETTING = ` + - Sets the value seen by a VC. For `application`, this value will override + any setting from the VC's side, until the cluster setting is reset. - - `ALTER TENANT ALL SET CLUSTER SETTING = ` - - Sets the value seen by all non-system tenants, except those that have a - specific value for that tenant (set with `ALTER TENANT `). For - `tenant-rw`, this value will override any setting from the tenant's side, + - `ALTER VIRTUAL CLUSTER ALL SET CLUSTER SETTING = ` + - Sets the value seen by all non-system VCs, except those that have a + specific value for that VC (set with `ALTER VIRTUAL CLUSTER `). For + `application`, this value will override any setting from the VC's side, until the cluster setting is reset. - Note that this statement does not affect the system tenant settings. - - - `ALTER TENANT RESET CLUSTER SETTING ` - - Resets the tenant setting. For `tenant-ro`, the value reverts to the `ALL` - value (if it is set), otherwise to the setting default. For `tenant-rw`, - the value reverts to the `ALL` value (if it is set), otherwise to whatever - value was set by the tenant (if it is set), otherwise the setting default. - - - `ALTER TENANT ALL RESET CLUSTER SETTING ` - - Resets the all-tenants setting. For tenants that have a specific value set - for that tenant (using `ALTER TENANT `), there is no change. For other - tenants, `tenant-ro` values revert to the setting default and `tenant-rw` - values revert to whatever value was set by the tenant (if it is set), - otherwise the setting default. - - - `SHOW CLUSTER SETTING FOR TENANT ` + Note that this statement does not affect the system tenant's settings. + + - `ALTER VIRTUAL CLUSTER RESET CLUSTER SETTING ` + - Resets the VC setting override. For `system-visible`, the value reverts to + the shared value in `system.settings` (if it is set), otherwise + to the setting default. For `application`, the value reverts to the + `ALL` value (if it is set), otherwise to whatever value was set + by the VC (if it is set), otherwise the build-time default. + + - `ALTER VIRTUAL CLUSTER ALL RESET CLUSTER SETTING ` + - Resets the all-VCs setting override. For VCs that have a specific + value set for that VC (using `ALTER VIRTUAL CLUSTER `), there is + no change. For other VCs, `system-visible` values revert to the + value set in system tenant's `system.settings`, or build-time default if + there's no customization; and `application` values revert to + whatever value was set by the VC (if it is set), otherwise + the build-time default. + + - `SHOW CLUSTER SETTING FOR VIRTUAL CLUSTER ` - Display the setting override. If there is no override, the statement returns NULL. (We choose to not make this statement 'peek' into - the tenant to display the customization set by the tenant itself.) + the VC to display the customization set by the VC itself.) - - `SHOW [ALL] CLUSTER SETTINGS FOR TENANT ` - - Display the setting overrides for the given tenant. If there is no + - `SHOW [ALL] CLUSTER SETTINGS FOR VIRTUAL CLUSTER ` + - Display the setting overrides for the given VC. If there is no override, the statement returns NULL. In all statements above, using `id=1` (the system tenant's ID) is not valid. -New semantics for existing statements for tenants: - - `SHOW [ALL] CLUSTER SETTINGS` shows the `tenant-ro` and `tenant-rw` settings. - `Tenant-ro` settings that have an override from the host side are marked as +New semantics for existing statements for VCs: + - `SHOW [ALL] CLUSTER SETTINGS` shows the `system-visible` and `application` settings. + `system-visible` settings that have an override from the KV side are marked as such in the description. - - `SET/RESET CLUSTER SETTING` can only be used with `tenant-rw` settings. For - settings that have overrides from the host side, the statement will fail - explaining that the setting can only be changed once the host side resets the + - `SET/RESET CLUSTER SETTING` can only be used with `application` settings. For + settings that have overrides from the KV side, the statement will fail + explaining that the setting can only be changed once the KV side resets the override. ## Implementation @@ -150,11 +185,12 @@ New semantics for existing statements for tenants: The proposed implementation is as follows: - We update the semantics of the existing `system.settings` table: - - on the system tenant, this table continues to store values for all - settings (for the system tenant only) - - on other tenants, this table stores only values for `tenant-rw` settings. + - on the system tenant, this table continues to store values for + all settings (for the system tenant only, and secondary VCs + for `system-visible` settings) + - on other VCs, this table stores only values for `application` settings. Any table rows for other types of variables are ignored (in the case that - the tenant manually inserts data into the table). + the VC manually inserts data into the table). - We add a new `system.tenant_settings` table with following schema: ``` @@ -168,9 +204,9 @@ The proposed implementation is as follows: PRIMARY KEY (tenant_id, name) ) ``` - This table is only used on the system tenant. All-non-system-tenant values + This table is only used on the system tenant. All-VC override values are stored in `tenant_id=0`. This table contains no settings for the system - tenant (`tenant_id=1`), and the `tenant_id=0` values do not apply to the + VC (`tenant_id=1`), and the `tenant_id=0` values do not apply to the system tenant. - We modify the tenant connector APIs to allow "listening" for updates to @@ -178,17 +214,17 @@ The proposed implementation is as follows: streaming RPC (similar to `GossipSubscription`). - On the system tenant we set up rangefeed on `system.tenant_settings` and keep - all the changed settings (for all tenants) in memory. We expect that in - practice overrides for specific tenants are rare (with most being "all - tenant" overrides). The rangefeed is used to implement the API used by the - tenant connector to keep tenants up to date. We continue to set up the + all the changed settings (for all VCs) in memory. We expect that in + practice overrides for specific VCs are rare (with most being "all + VC" overrides). The rangefeed is used to implement the API used by the + tenant connector to keep VCs up to date. We continue to set up the rangefeed on the `system.settings` table to maintain the system tenant settings. - - On non-system tenants we continue to set up the rangefeed on the tenant's - `system.settings` table, and we also use the new connector API to listen to - updates from the host. Values from the host which are present always override - any local values. + - On non-system VCs we continue to set up the rangefeed on the + VC's `system.settings` table, and we also use the new connector + API to listen to updates from the storage cluster. Values from the + storage cluster which are present always override any local values. ### Upgrade @@ -197,11 +233,11 @@ The proposed solution has very few concerns around upgrade. There will be a migration to create the new system table, and the new connector API implementation is only active on the new version (in a mixed-version cluster, it can error out or use a stub no-op implementation). The new statements (around -setting per-tenant values) should also error out until the new version is +setting per-VC values) should also error out until the new version is active. -The settings on the system tenant will continue to work. On non-system tenants, -any locally changed settings that are now `system` or `tenant-rw` will revert to +The settings on the system tenant will continue to work. On non-system VCs, +any locally changed settings that are now `system` or `application` will revert to their defaults. It will be necessary to set these settings from the system tenant (using the new statements) if any clusters rely on non-default values. @@ -216,22 +252,23 @@ When deciding which class is appropriate for a given setting, we will use the following guidelines: - if the setting controls a user-visible aspect of SQL, it should be a - `tenant-rw` setting. + `application` setting. - - control settings relevant to tenant-specific internal implementation (like - tenant throttling) that we want to be able to control per-tenant should be - `tenant-ro`. + - control settings relevant to VC-specific internal implementation (like + VC throttling) that we want to be able to control per-VC should be + `system-visible`, or possibly `application` with an override, depending on whether + we want different overrides for different VCs. - - when in doubt the first choice to consider should be `tenant-rw`. + - when in doubt the first choice to consider should be `application`. - `system` should be used with caution - we have to be sure that there is no - internal code running on the tenant that needs to consult them. + internal code running on the VC that needs to consult them. -We fully hide `system` settings from non-system tenants. The cluster settings -subsystem will not allow accessing these values from a tenant process (it will -crash the tenant process, at least in testing builds, to find any internal code +We fully hide `system` settings from non-system VCs. The cluster settings +subsystem will not allow accessing these values from a VC process (it will +crash the VC process, at least in testing builds, to find any internal code that incorrectly relies on them). The values of these settings are unknown to -the tenant APIs for changing tenant settings (i.e. if a tenant attempts to read +the VC APIs for changing VC settings (i.e. if a VC attempts to read or set such a setting, it will get the "unknown cluster setting" error). @@ -243,7 +280,7 @@ There are three possibilities in terms of the system table changes: - Pro: clean schema, easier to reason about. - Pro: no schema changes on the existing system table. - - b) Use the existing `system.settings` table as is. For tenant-specific + - b) Use the existing `system.settings` table as is. For VC-specific settings and overrides, encode the tenant ID in the setting name (which is the table PK), for example: `tenant-10/sql.notices.enabled`. - Pro: no migrations (schema changes) for the existing system table. @@ -262,18 +299,18 @@ There are three possibilities in terms of the system table changes: (with the added concern that we have code that parses the raw KVs for this table directly). -A previous proposal was to store `tenant-ro` values in each tenant's +A previous proposal was to store `system-visible` values in each VC's `system.settings` table and disallowing arbitrary writes to that table. While this would be less work in the short-term, it will give us ongoing headaches -because it breaks the tenant keyspace abstraction. For example, restoring a +because it breaks the VC keyspace abstraction. For example, restoring a backup will be problematic. -Another proposal was to store all tenant settings on the host side and allow the -tenant to update them via the tenant connector. This is problematic for a number +Another proposal was to store all VC settings on the storage side and allow the +VC to update them via the tenant connector. This is problematic for a number of reasons, including transactionality of setting changes and opportunities for abuse. A previous version of the proposal included a "system visible" (or "shared -read-only") class, for system settings that the tenants can read. However, given -the support for all-tenant values for `tenant-ro`, the functional differences +read-only") class, for system settings that the VCs can read. However, given +the support for all-VC values for `system-visible`, the functional differences between these two classes becomes very small.