Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: add query denylist facility #51643

Open
jordanlewis opened this issue Jul 21, 2020 · 21 comments
Open

sql: add query denylist facility #51643

jordanlewis opened this issue Jul 21, 2020 · 21 comments
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-sre For issues SRE opened or otherwise cares about tracking. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@jordanlewis
Copy link
Member

jordanlewis commented Jul 21, 2020

It can be desirable to deny a query fingerprint from being run on the database, because of bugs in CockroachDB, difficulty in tracking down expensive queries in a workload, or other reasons. There's currently no way to do this besides turning off the source of the query.

We should add a facility to deny a particular query from being run at all. I'm imagining a system table that has a query fingerprint as a key. We'd need some way to cache this information, since it would not be okay to check the denylist on receipt of every query.

Conceivably, a similar mapping could be used for query plan management, so whatever we decide on representation wise, we should make sure that we can extend the "settings" for a query fingerprint in new ways.

Another important feature, mentioned by @joshimhoff, is the inclusion of a "dry run" flag. Blocking a query fingerprint is very disruptive and dangerous. It's important to be able to test the denial of a fingerprint before activating it. Turning on the "dry run" flag for a query fingerprint should cause the database to log when it receives a query that would be blocked by that rule.

Probably, a feature like this should also get a time series metric to help operators detect when it is working in a way that is not just looking at the logs.


Jira issue: CRDB-4022

@jordanlewis jordanlewis added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jul 21, 2020
@jordanlewis
Copy link
Member Author

@solongordon, I've tentatively assigned this to the features team, but it's certainly up for debate.

@joshimhoff
Copy link
Collaborator

Thanks for opening this!

Probably, a feature like this should also get a time series metric to help operators detect when it is working in a way that is not just looking at the logs.

Ya. Could also have an admin UI page (not mutually exclusive with the time-series metric).

@otan
Copy link
Contributor

otan commented Jul 21, 2020

Calling out adapting something like https://github.com/dropbox/load_management as a solution - this library does rate limiting for query patterns in general as opposed to a straight up deny list, if that's something we ever decide to foray into...

@joshimhoff
Copy link
Collaborator

SRE really needs this!

I found a duplicate issue from January: #44597. That was an action item from this incident: https://cockroachlabs.atlassian.net/wiki/spaces/MC/pages/273449722/2020-01-22+import+job+of+death+crash-loops+cluster. Should have poked it more!

Two Fridays ago, we had two production issues that could have been mitigated much quicker and with much less effort (from SRE & DB eng), if a denylist was available:

  1. Customer uses an experimental feature (tmp tables? related to tmp tables...) in such a way that causes all system tables to become unavailable: sql: fix connection shutdown with temp tables #52148
  2. OOMs caused by an inverted index creation job (probably caused by not having memory accounting on the backfiller component).

In both cases, the customer does something which triggers the bad behavior. With 1, SRE could make system tables available again by rolling the nodes, but if the customer uses the problematic feature again, same issue.

This isn't sustainable. There are a lot of cloud clusters already. If the cause of some bad behavior is a known stream of queries, we need to be able to block that stream from running, without talking to a customer. Note that between 1 and 2, SRE got paged 5-6 different times over a three day period, including once at night, even though both issues were understood.

In general, SRE needs mitigations that don't involve code changes!

Relatedly, this makes the coding fix lower priority to release, which is good for DB eng and release eng (release eng made a cherypick release for SRE on Friday).

@jordanlewis
Copy link
Member Author

While I agree we should really have this feature to protect SRE time and customer clusters, I'm not so sure I agree that your two examples are the best examples. In my mind, the use of the denylist is to help customers who have workloads running that are pushing things over against their wishes, but they can't isolate which queries are doing it or stop just part of the workload without bringing everything down.

In this case, on the other hand, a user is trying and failing to run something over and over. Let's say we used the denylist to deny these queries from being run. What error would the user get after we added the denylist? Would the user actually be happier in this case?

@joshimhoff
Copy link
Collaborator

joshimhoff commented Aug 10, 2020

In this case, on the other hand, a user is trying and failing to run something over and over. Let's say we used the denylist to deny these queries from being run. What error would the user get after we added the denylist? Would the user actually be happier in this case?

Here are the two examples:

  1. Customer uses an experimental feature (tmp tables? related to tmp tables...) in such a way that causes all system tables to become unavailable: sql: fix connection shutdown with temp tables #52148
  2. OOMs caused by an inverted index creation job (probably caused by not having memory accounting on the backfiller component).

Re: 2, for a long while, the index creation job wasn't succeeding. It is strictly better for the user to have a non-OOMing cluster & failing index creation job than to have an OOMing cluster & a failing index creation job. Also, creating an inverted index is not generally in the critical path of a customer's application. Some queries are more important to serve than others. SRE can't know exactly what is best for a particular customer, but there are rules of thumbs: It's better to fail a very small number of queries (not kinds of queries but absolute # of queries) than most / all queries. It's better to fail DDL (certain statements especially, e.g. CC's SLA on backup success isn't tight yet) than DML.

Re: 1, my understanding is that the feature was experimental. I def think we should turn off an experimental feature to protect the overall health of a cluster.

It is certainly hard to generalize. But blacklisting queries is an option that the SRE oncall needs. Even more so once we run mulit-tenant CRDB, since then there are other users to protect.

In this case, on the other hand, a user is trying and failing to run something over and over.

Ya...
The user isn't necessarily aware of the effects of their actions on the reliability of their own CC cluster & app though...

Part of the deal with a managed service is you give up some control to the group running the service.

What error would the user get after we added the denylist?

Something that tells the user that the queries have been blacklisted to protect overall system health & to contact support if needed.

@otan
Copy link
Contributor

otan commented Sep 9, 2020

cc @awoods187 @vy-ton

Looking at getting Angela started on this over the upcoming few weeks. I am proposing we scope this issue as a pure "deny-list" functionality, as opposed to a "rate limit" solution (which requires substantially more work and much more configuration settings and knobs).

I've started us off with strawman implementation ideas. When this has settled, we'll turn this into an RFC. Let me know if you prefer a google drive form!

EDIT: use this doc for collaboration


Strawman Denylist Parameters

Option 0: Exact Query Match denial

As it sounds, just take the query the user gives in as the denylist. Does not work well in the case the queries may look different, e.g. on the WHERE clause. For that, it's probably a non-starter.

Option 1: Query Fingerprint

We can allow query finger prints, similar to the ones shown on the list statements page. They look like this:

SELECT _, _, _ FROM _ WHERE _ = _ ORDER BY _

If we have a matching finger print, we deny the request. This has the side effect that it may block legit traffic which would otherwise be allowed, since we anonymise some of these fields.

Option 2: Regex

Instead, allow specification of regexp to act as the denylist, e.g.

SELECT a, b, c FROM d WHERE e = .* ORDER BY f

This can get annoying as some elements may have to be escaped, e.g. SELECT ((a)) FROM b would be annoying SELECT \(\(a\)\) FROM b. However, overall it gives more power to the user.

Option 3: All of the above

Offer all of the above. Makes the configuration much more complicated, but covers all cases.

Strawman Denylist Management

Here are a couple of strawman ideas for managing the denylist:

Option A: File managed blacklist

Store a denylist locally on every machine as a config file, e.g. query_denylist.yaml. A db admin would have to scp this file onto every machine if this requires an update, and cockroachdb would be watching for file changes to this file. Easy solution to implement, hard to use by the customer as it requires machine access.

If a user wants to "dry-run" these changes, they could upload a query_denylist.staging.yaml file to all machines, which would log but not actually execute the query.

File Format?

YAML? JSON? Custom format? Which one is best?

Option B: CLUSTER SETTING

Use a cluster setting, e.g. SET CLUSTER SETTING sql_denylist = 'queryA, queryB. queryC', to be used as the denylist. This propagates the current way it does now and is easy enough to implement.

Unfortunately, the single string being used as a denylist can be quite unmanageable. It most likely requires some sort of file format. However, we can do something like SET CLUSTER SETTING sql_denylist IMPORT FROM <stdin/nodelocal/etc>, which reads in the file as appropriate and sends that off as a string. This likely would also be useful for other things in future.

Dry runs can be made by a similar cluster setting variable, e.g. SET CLUSTER SETTING sql_staging_denylist.

Bit more work, still has the file management issue.

Option C: System table

Use a system table to store denylist queries. Create a table like so:

CREATE TABLE system.denylist (
   id uuid PRIMARY KEY DEFAULT unique_rowid(),
   param string,
   type enum('regex', 'fingerprint', 'raw'),
   dryrun bool,
)

Each node would cache this denylist, refreshing this table periodically (e.g. every 30 seconds). Alternatively, an update is sent via gossip that forces these nodes to update when this table is updated (the latter sounds like something should have been done about this before). The updating can cause extra contention, so some sort of gossip-sync mechanism on top of that in exchange for a lower refresh interval sounds like a better idea.

We can create builtins which insert into this table, to make it easy to manage, e.g. SELECT insert_into_denylist('fingerprint', 'SELECT _ FROM _', false) (as opposed to INSERT INTO system.denylist VALUES (...).

I'm not experienced with gossip / notify protocols to know whether the update protocol is set, but if it is, sounds like a good option.

Option D: env variable/cli flag

Similar to A, except a user would require to set an ENV var / CLI flag. This would require restarting every node to manage.

Metrics

I assume we'll want all these metrics:

  • Time series graph on # queries denied
  • Verbosity-specified level of logging in logs and/or logging to a different file when queries get denied.

Messaging

The output message to the user should be something useful, e.g.

pgerrors.Newf(pgcode.ConfigurationLimitExceeded, "this query was added to a denylist by the database administrator")

@joshimhoff
Copy link
Collaborator

THIS IS SO EXCITING!!!!

@joshimhoff
Copy link
Collaborator

joshimhoff commented Sep 9, 2020

Some thoughts:

  1. I think it makes sense to build option 1 or 2? Option 0 doesn't go far enough, does it?

  2. I think a dry run mode is a requirement. It's just too dangerous setting the deny list without a dry run mode. Agreed?

  3. I think another really important requirement is to be able to set this in emergencies, when a cluster may not be available.

More on 3. Here are the three options:

a. File.
b. Cluster setting.
c. System table.

I dislike b & c since it may be hard to make a change via those mechanisms if the production incident is very severe.

a sounds good at first, but in the k8s environment (in the CC environment), it's hard to get access to the persistent disk if nodes is crashing in certain ways, e.g. if the CRDB nodes are OOMing. In this case, the containers don't stay up for long. There are ways around this, but they take time and they are painful.

The simplest interface for operators in the k8s environment is a CLI flag or env variable, IMHO. These can be set regardless of what is happening with the cluster, and k8s has good support for both.

@otan
Copy link
Contributor

otan commented Sep 9, 2020

I think it makes sense to build option 1 or 2? Option 0 doesn't go far enough, does it?

Yeah. But it's the ultimate strawman :D

I think a dry run mode is a requirement. It's just too dangerous setting the deny list without a dry run mode. Agreed?

Agreed!

I think another really important requirement is to be able to set this in emergencies, when a cluster may not be available.

Heard.

it's hard to get access to the persistent disk if nodes is crashing in certain ways, e.g. if the CRDB nodes are OOMing

OOM would not stop you placing a file onto disk, would it?

The simplest interface for operators in the k8s environment is a CLI flag or env variable, IMHO. These can be set regardless of what is happening with the cluster, and k8s has good support for both.

Would that require a restart for every node in the cluster? Also, wouldn't managing a larger denylist be pretty difficult using env variables?

@petermattis
Copy link
Collaborator

My personal preference would be for this configuration to be done via a cluster setting, similar to the hba.conf setting. See server.host_based_authentication.configuration. It is true that we'd want to have safeguards in place against an admin shooting themselves in the foot with the denylist. Perhaps the config could have two modes for each regex/pattern: deny and log. That would allow the admin to test the pattern with log before enabling it with deny.

We probably also want to prevent denylisting certain queries, or queries from certain users. For example, denying SET CLUSTER SETTING .* could be problematic as then you'd never be able to adjust the denylist again. This wouldn't be a problem if the denylist did not apply to users who are able to set cluster settings.

An argument against using a cluster setting is that a query of death may be prohibiting the cluster from successfully executing such a query.

@joshimhoff
Copy link
Collaborator

joshimhoff commented Sep 10, 2020

We probably also want to prevent denylisting certain queries, or queries from certain users. For example, denying SET CLUSTER SETTING .* could be problematic as then you'd never be able to adjust the denylist again. This wouldn't be a problem if the denylist did not apply to users who are able to set cluster settings.

Good point!

An argument against using a cluster setting is that a query of death may be prohibiting the cluster from successfully executing such a query.

Ya this is my issue with using a cluster setting. I think if we use a cluster setting there needs to be another path to setting it that is available even if the cluster is not. Really, ideally, this applies to all cluster settings, when I think about it.

OOM would not stop you placing a file onto disk, would it?

On k8s it is hard to write to the disk if a cluster is OOMing. If OOMing, the containers don't stay up long enough for an operator to exec in (https://kubernetes.io/docs/tasks/debug-application-cluster/get-shell-running-container/) and write the file. There are ways around this (e.g. stop the CRDB pod from executing & mount the disk via another pod) but they are fairly high toil in the k8s environment. Setting a CLI flag is easy OTOH with k8s.

Would that require a restart for every node in the cluster?

Yes but I don't think this is a problem really. CRDB can handle rolling restarts. What do you dislike about this?

Also, wouldn't managing a larger denylist be pretty difficult using env variables?

Ya a bit tedious. A file is nice if there is some way for an operator to easily set it in the k8s environment even when nodes are crashing. Maybe there is a way. Will need to think more with my SRE friends. Possibly a k8s config map that is mounted somewhere on the filesystem (that is, not the store dir). So to y'all, just a file, but not in the store directory.

https://kubernetes.io/docs/concepts/configuration/configmap/

@chrisseto
Copy link
Contributor

chrisseto commented Sep 10, 2020

Swooping in with some scope creep, would it be possible to include a message with denied queries?

Hypothetically, if the host of the CRDB instance doesn't have a direct line of communication to the end user is may be helpful to provide some context or further instructions when a query is rejected.

IE: "Hey this is a bad query try ___ instead" Or "This triggers a known bug. See [github link] for work arounds and details."

EDIT: swapped <> with []

@otan
Copy link
Contributor

otan commented Sep 10, 2020

IE: "Hey this is a bad query try ___ instead" Or "This triggers a known bug. See for work arounds and details."

Yep, that (should be) easy enough. I've added it to the comment.

@otan
Copy link
Contributor

otan commented Sep 10, 2020

Ya a bit tedious. A file is nice if there is some way for an operator to easily set it in the k8s environment even when nodes are crashing. Maybe there is a way. Will need to think more with my SRE friends. Possibly a k8s config map that is mounted somewhere on the filesystem (that is, not the store dir). So to y'all, just a file, but not in the store directory.

Let me know what happens here. it sounds like for foolproofness, the file / client env var approach is the minimum requirement.

I think it's possible we may want files and cluster settings, for ease of use, but file seems to be the most foolproof way. Would be nice to get input from other users as well.

@JuanLeon1
Copy link
Contributor

I like the config map because it sticks out in github PRRs like a sore thumb after you have made a cluster into a pet.
Also, I think k8s can implement the config map as a file, which, from teh DB side, reduces to the file case, and does not require a node restart.
Eventually, config maps can also be read by an "admin" service in the pod that could do some of the other things we need to do remotely (network-layer-throttling-by-principal, e.g), but that is ill-defined and further in the future.

@DuskEagle
Copy link
Member

I think reading from a file, the path of which can be specified via a flag to cockroach start, provides the best operator experience. E.g. cockroach start --sql-denylist=/path/to/denylist. Like Juan said, this also works well in a K8s environment, where you can manage the denylist using a ConfigMap which is projected to a file within the CRDB pod. If CRDB can watch for changes to this file and reload the denylist, then no restart is necessary.

I think it's possible we may want files and cluster settings, for ease of use, but file seems to be the most foolproof way. Would be nice to get input from other users as well.

I can see this as well. Using files is useful for operators that have filesystem-level access to the cluster, and is also necessary when attempting to configure a denylist when the cluster is unavailable. For users with only SQL-level access, a cluster setting would be useful. But like @petermattis mentioned, we'd have to be careful about locking users out of their cluster by denylisting SET CLUSTER SETTING queries.

'm not convinced about the idea that the denylist should not apply to users who are able to set cluster settings - many potentially destabilizing queries, such as large bulk-io operations, are already restricted to admin users. As an operator, if you're trying to denylist a particular query pattern, you'd have to check if the query is originating from an admin user or not to know if the denylist entry would have any effect.

@petermattis
Copy link
Collaborator

I think reading from a file, the path of which can be specified via a flag to cockroach start, provides the best operator experience. E.g. cockroach start --sql-denylist=/path/to/denylist.

Will --sql-denylist apply only to the current node? Is there any concern about different nodes having different denylists?

If CRDB can watch for changes to this file and reload the denylist, then no restart is necessary.

Watching for changes to a file isn't something CRDB typically does (I'm not sure if there are existing examples of such behavior). We could add a facility to reload the file via a SQL builtin or similar.

@otan
Copy link
Contributor

otan commented Sep 14, 2020

moving this to a doc because there's a lot of threads to pull here.

@angelapwen
Copy link

angelapwen commented Oct 20, 2020

PR for the RFC is up at #55778, borrowing heavily from this discussion and the brainstorming doc!

craig bot pushed a commit that referenced this issue Nov 12, 2020
55325: ui: Transactions link between Sessions and Statements r=dhartunian a=elkmaster

Resolves: #55131, #56244

Release note (admin ui change): Link to the Transactions page is now shown
between the Sessions and Statements links in the left hand navigation. This more
clearly reflects the hierarchy between the 3 concepts.

56346: testcluster: minor logging improvements r=andreimatei a=andreimatei

Log when TestCluster quiescing starts, and add a node log tag to each
node's quiescing ctx so messages from different nodes can be
disambiguated.

Release note: None

56437: cli, ui: dismiss release notes signup banner per environment variable r=knz,dhartunian a=nkodali

Previously, the signup banner could only be dismissed manually.
For internal testing purposes, this banner is unnecessary. This
change provides a way to dismiss the signup banner upon start of
a cluster via the cli by setting the environment variable
COCKROACH_UI_RELEASE_NOTES_SIGNUP_DISMISSED=true.

Resolves #46998

Release note: none

56533: backupccl: add feature flag support for BACKUP, RESTORE r=otan a=angelapwen

Follow-up to the RFC at #55778. This addresses the SRE use case mentioned in #51643 — instead of moving forward with a global denylist as the RFC indicated, we are prototyping feature flags via cluster settings to turn on/off requested features. The first part of the prototype will be done on `BACKUP` and `RESTORE` commands.

See [this doc](https://docs.google.com/document/d/1nZSdcK7YprL0P4TAuseY-mvlYnd82IaJ_ptAQDoWB6o/edit?) for further details. 

Note that the logic test under `ccl/backupccl/testdata/backup-restore/feature-flags` can be tested with the command `make test PKG=./pkg/ccl/backupccl TESTS='TestBackupRestoreDataDriven'`

— Commit message below — 

Adds a cluster setting to toggle a feature flag for the BACKUP and
RESTORE commands off and on; as well as a broad category for
Bulk IO commands. Currently disabling the cluster setting for Bulk
IO will only disable BACKUP and RESTORE jobs, but other types may
be included in this category in the future..

The feature is being introduced to address a Cockroach Cloud SRE
use case: needing  to disable certain categories of features in
case of cluster failure.

Release note (enterprise change): Adds cluster settings to enable/
disable the BACKUP and RESTORE commands. If a user attempts to use
these features while they are disabled, an error indicating that
the database administrator has disabled the feature is surfaced.

Example usage for the database administrator:
SET CLUSTER SETTING feature.bulkio.backup.enabled = FALSE;
SET CLUSTER SETTING feature.bulkio.backup.enabled = TRUE;
SET CLUSTER SETTING feature.bulkio.restore.enabled = FALSE;
SET CLUSTER SETTING feature.bulkio.restore.enabled = TRUE;
SET CLUSTER SETTING feature.bulkio.enabled = FALSE;
SET CLUSTER SETTING feature.bulkio.enabled = TRUE;

56591: ui: fix Overview screen in OSS builds r=dhartunian a=davepacheco

Previously, when using OSS builds (created with `make buildoss`), when
you loading the DB Console in your browser, you'd get "Page Not Found".
The route for the overview page was missing the leading '/'.  This bug
appears to have been introduced in
722c932.

Release note (admin ui change): This fixes a bug where users of
the OSS builds of CockroachDB would see "Page Not Found" when loading
the Console.

56600: roachpb: remove SetInner in favor of MustSetInner r=nvanbenschoten a=tbg

As of a recent commit, `ErrorDetail.SetInner` became unused, and
we can switch to a `MustSetInner` pattern for `ErrorDetail`. Since
the codegen involved is shared with {Request,Response}Union, those
lose the `SetInner` setter as well; we were always asserting on
the returned bool there anyway so this isn't changing anything.

Release note: None



Co-authored-by: Vlad Los <[email protected]>
Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Namrata Kodali <[email protected]>
Co-authored-by: angelapwen <[email protected]>
Co-authored-by: Joshua M. Clulow <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
craig bot pushed a commit that referenced this issue Nov 13, 2020
56533: backupccl: add feature flag support for BACKUP, RESTORE r=otan a=angelapwen

Follow-up to the RFC at #55778. This addresses the SRE use case mentioned in #51643 — instead of moving forward with a global denylist as the RFC indicated, we are prototyping feature flags via cluster settings to turn on/off requested features. The first part of the prototype will be done on `BACKUP` and `RESTORE` commands.

See [this doc](https://docs.google.com/document/d/1nZSdcK7YprL0P4TAuseY-mvlYnd82IaJ_ptAQDoWB6o/edit?) for further details. 

Note that the logic test under `ccl/backupccl/testdata/backup-restore/feature-flags` can be tested with the command `make test PKG=./pkg/ccl/backupccl TESTS='TestBackupRestoreDataDriven'`

— Commit message below — 

Adds a cluster setting to toggle a feature flag for the BACKUP and
RESTORE commands off and on. The feature is being introduced to
address a Cockroach Cloud SRE use case: needing to disable certain
categories of features in case of cluster failure.

Release note (enterprise change): Adds cluster settings to enable/
disable the BACKUP and RESTORE commands. If a user attempts to use
these features while they are disabled, an error indicating that
the database administrator has disabled the feature is surfaced.

Example usage for the database administrator:
SET CLUSTER SETTING feature.backup.enabled = FALSE;
SET CLUSTER SETTING feature.backup.enabled = TRUE;
SET CLUSTER SETTING feature.restore.enabled = FALSE;
SET CLUSTER SETTING feature.restore.enabled = TRUE;

56591: ui: fix Overview screen in OSS builds r=dhartunian a=davepacheco

Previously, when using OSS builds (created with `make buildoss`), when
you loading the DB Console in your browser, you'd get "Page Not Found".
The route for the overview page was missing the leading '/'.  This bug
appears to have been introduced in
722c932.

Release note (admin ui change): This fixes a bug where users of
the OSS builds of CockroachDB would see "Page Not Found" when loading
the Console.

56631: kv: remove assertion around QuotaPool and Raft window size config r=nvanbenschoten a=nvanbenschoten

This check was well intentioned when added, but seems misguided
in retrospect. There are valid reasons why someone might want to
configure limits to various dimensions of the Raft window size
in ways that, when combined, exceeds the total window size limit
(i.e. the quota pool size).

Co-authored-by: angelapwen <[email protected]>
Co-authored-by: Joshua M. Clulow <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
craig bot pushed a commit that referenced this issue Nov 19, 2020
56872: changefeedccl, importccl, sql: IMPORT/EXPORT/changefeed feature flags r=otan a=angelapwen

(Duplicates #56833 which was not updating my most recent commits)

Adds a cluster setting to toggle a feature flag for the IMPORT,
EXPORT, and changefeed commands off and on. The feature is being
introduced to address a Cockroach Cloud SRE use case: needing to
disable certain categories of features in case of cluster failure.
See #51643

Release note (enterprise change): Adds cluster settings to enable/
disable the IMPORT, EXPORT, and changefeed  commands. If a user
attempts to use these features while they are disabled, an error
indicating that the database administrator has disabled the
feature is surfaced.

Example usage for the database administrator:
SET CLUSTER SETTING feature.import.enabled = FALSE;
SET CLUSTER SETTING feature.import.enabled = TRUE;
SET CLUSTER SETTING feature.export.enabled = FALSE;
SET CLUSTER SETTING feature.export.enabled = TRUE;
SET CLUSTER SETTING feature.changefeed.enabled = FALSE;
SET CLUSTER SETTING feature.changefeed.enabled = TRUE;

Co-authored-by: angelapwen <[email protected]>
craig bot pushed a commit that referenced this issue Nov 25, 2020
57040: sql, featureflag: add schema change feature flag r=otan a=angelapwen

The following features have been tested and act as expected:

- [x] ALTER DATABASE OWNER
- [x] ALTER DATABASE ADD REGION
- [x] ALTER DATABASE DROP REGION
- [x] ALTER DATABASE SURVIVE
- [x] ALTER TABLE REGIONAL AFFINITY
- [x] ALTER TABLE SET SCHEMA
- [x] ALTER SCHEMA
- [x] ALTER TYPE
- [x] ALTER SEQUENCE RENAME
- [x] ALTER SEQUENCE SET SCHEMA 
- [x] CREATE DATABASE
- [x] CREATE INDEX
- [x] CREATE SCHEMA
- [x] CREATE TABLE
- [x] CREATE TYPE
- [x] CREATE VIEW
- [x] CREATE SEQUENCE
- [x] DROP DATABASE
- [x] DROP INDEX
- [x] DROP TABLE
- [x] DROP VIEW
- [x] DROP SEQUENCE
- [x] DROP TYPE
- [x] DROP SCHEMA
- [x] REASSIGN OWNED BY
- [x] DROP OWNED BY
- [x] RENAME DATABASE
- [x] REPARENT DATABASE
- [x] RENAME INDEX
- [x] RENAME TABLE
- [x] RENAME COLUMN 
- [x] COMMENT ON COLUMN 
- [x] COMMENT ON DATABASE
- [x] COMMENT ON INDEX
- [x] COMMENT ON TABLE
- [x] ALTER INDEX CONFIGURE ZONE
- [x] ALTER TABLE CONFIGURE ZONE

See #51643 for background.

ANALYZE/CREATE STATISTICS will not be considered a schema change and were added in a separate PR, #57076 
The SPLIT/UNSPLIT features will also be resolved in a separate PR, as the execution path is different from the schema changes addressed in this PR. 

---Commit message---

This change adds a feature flag via cluster setting for all
designated features that perform schema changes or DDLs. The
feature is being introduced to address a Cockroach Cloud SRE use
case: needing to disable certain categories of features, such as
schema changes, in case of cluster failure.

Release note (sql change): Adds a feature flag via cluster
setting for all schema change-related features. If a user attempts
to use these features while they are disabled, an error indicating
that the database administrator has disabled the feature is
surfaced.

Example usage for the database administrator:
SET CLUSTER SETTING feature.schemachange.enabled = FALSE;
SET CLUSTER SETTING feature.schemachange.enabled = TRUE;

Co-authored-by: angelapwen <[email protected]>
@rafiss
Copy link
Collaborator

rafiss commented Mar 15, 2021

Leaving this in the long-term backlog since the project focus was changed to making feature-specific cluster settings. If we need a real denylist later, we can revisit.

@rafiss rafiss added the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label May 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-sre For issues SRE opened or otherwise cares about tracking. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
None yet
Development

No branches or pull requests

9 participants