Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: create SQL query denylist RFC #55778

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

angelapwen
Copy link

Co-Authored-By: Oliver Tan [email protected]

View text at: https://github.com/angelapwen/cockroach/blob/denylist-rfc/docs/RFCS/20201020_sql_query_denylist.md

Thank you in advance for the reviews! ✨

@angelapwen angelapwen requested review from joshimhoff and otan October 20, 2020 20:30
@angelapwen angelapwen requested a review from a team as a code owner October 20, 2020 20:30
@angelapwen angelapwen self-assigned this Oct 20, 2020
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@otan otan requested a review from vy-ton October 20, 2020 21:43
@angelapwen angelapwen force-pushed the denylist-rfc branch 3 times, most recently from f4d7de1 to 32749dc Compare October 20, 2020 22:34
@otan
Copy link
Contributor

otan commented Oct 21, 2020

(seems like you need to rebase and adjust your PR to have a release note for this to go green!)

@angelapwen angelapwen force-pushed the denylist-rfc branch 3 times, most recently from 5ebf9a1 to b8567f2 Compare October 21, 2020 03:56
Release note: None

Co-Authored-By: Oliver Tan <[email protected]>

It is important to support a file-managed denylist so that database administrators are able to quickly deny queries that have been confirmed to bring down clusters to the point of unresponsiveness. In this situation, SQL commands cannot be run from the console and cluster settings will not be able to be set.

Without this functionality, there may be applications endlessly retrying the “bad” queries that will continuously bring the cluster down when executed. There is currently no built-in way to stop execution of a query like this other than by preventing the query from being run in the first place, and it is time- and resource-intensive for developers to write their own.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time- and resource-intensive for developers to write their own

Clarify what "write their own" means? I believe this refers to developers modifying the application code to stop running certain queries which is not always possible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is what I meant — I'll modify to clarify.


# Summary

We propose a mechanism built into CockroachDB to prevent operator-defined SQL queries from being executed. This will immediately address the situations where clusters enter unresponsive states due to a query, or a query pattern, that has been identified. To bring the clusters back up as soon as possible and prevent them from going down again due to the same query, we need a query denylist on the database. This RFC details a file-managed SQL query denylist implementation that will allow database administrators to trigger query denial by uploading a YAML file comprising regular expression patterns to a customizable file location.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DBAs would need file write permission to do this. What user/level of access do CC SREs have to CC clusters?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the answer to this — perhaps an SRE reviewing here could answer? @joshimhoff or @JuanLeon1?

The denylist we define for this RFC and plan to implement is a global, non-configurable SQL query denylist in CockroachDB. A location for this denylist will be configured via a CLI flag, and if a denylist is uploaded to that location, any following queries matching any pattern on the denylist file will be denied. The supported file format for the denylist is YAML; each query pattern on the denylist file must be specified via regular expression conventions.

### Staging denylist
The staging denylist is intended to be used so that a user will be able to test whether the query patterns on their denylist deny the appropriate queries, without actually blocking them. A separate location for the staging denylist will be configured via a CLI flag. Otherwise, it should be formatted in the same way as the denylist.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open question about whether we need a staging denylist.

Assuming we do, a separate file seems more error-prone instead of marking queries as "staged" in a single denylist fiile.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, @JuanLeon1 mentioned in our call that having a separate file for the staging denylist would be preferable so that it could go through a separate pull request/review process for changes, for auditing purposes. Perhaps I misinterpreted and he meant that having a single file is preferable for efficiency?

## New named concepts

### Denylist
The denylist we define for this RFC and plan to implement is a global, non-configurable SQL query denylist in CockroachDB. A location for this denylist will be configured via a CLI flag, and if a denylist is uploaded to that location, any following queries matching any pattern on the denylist file will be denied. The supported file format for the denylist is YAML; each query pattern on the denylist file must be specified via regular expression conventions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"non-configurable" is confusing here, I assume you mean non-configurable through SQL

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, yes you're right because evidently it can be configured through modifying the YAML file 😄 I'll think of better wording for this or remove the word altogether.

### User experience from a database administrator's point of view
A database administrator, anticipating that a denylist or staging denylist may be used in the future, uses a flag to configure the location where the denylist would be located when it is needed. When they notice a cluster is unresponsive, and isolate the root cause to a particular query (or multiple), they optionally may want to confirm the regex query pattern matches the queries they want to deny.

To confirm, they can add a staging denylist to the location they had specified for the staging denylist and execute the query. The query will execute, and if the query matches the regex query pattern specified in the denylist, the administrator will be able to view this in the Admin UI, SQL console, and CRDB logs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're assuming an unresponsive cluster, the Admin UI and SQL console would be inaccessible? So you could only confirm staged queries with logs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, in that situation they'd have to just confirm via logs. I'll update to clarify.

## Detailed design

### Design overview
We will implement a file-managed denylist, stored locally on each machine as a YAML configuration file. The exact location of the denylist should be configured with a flag – this will allow users to specify the location of their file, such as via the k8s ConfigMap for CockroachCloud SREs. The file location will be watched so that as soon as the config file is created, modified, or removed, the denylist will be updated. Pattern matching between queries executed and queries on the denylist will be done via regex so users maintain granular control of the queries affected by the denylist.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How frequently do we check the denylist for updates?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's mentioned later -- there's a file watcher that notifies when a file is modified

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's mentioned later -- we get notified whenever the file is modified

# Queries to be denied; in the future we may support denying other fields.
sql:
# Denies the experimental temp tables feature to be turned on.
- “SET experimental_enable_temp_tables=on”
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this include all cluster settings?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand what you mean here. Do you mean that all cluster settings could be added to this denylist? I think that's right. They can always be removed.

### File-managed denylist implementation
We expect users to create their own YAML file with a `sql` field that lists an array of strings representing regex to match any SQL queries to be denied. This allows granular control of specific command types as well as database objects — a user should be able to denylist any command of a certain type, any command pertaining to a particular database object, or any additional parameters that may be affecting the cluster.

In the future, we could extend the functionality with additional fields such as, for example, `IP` to support denying queries made from specific IP addresses.
Copy link
Contributor

@vy-ton vy-ton Oct 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list doesn't include arbitary queries and is all features.

@joshimhoff or @JuanLeon1 Is the ability to dynamically turn off specific features what CC SRE really needs? We wouldn't need a denylist to support arbitrary query patterns if we just need to prevent backups, CDC, temp tables etc?

### Metrics logging
When queries are denied, a level 2 event will be logged: `denylist match found: query %s denied, pattern matched %s`.

Several graphs will be added to the Admin UI, including:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the phase where we think a file-based approach is required to meet the use case, this wouldn't be needed


Example denylist: **denylist.yaml**
```
# Queries to be denied; in the future we may support denying other fields.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if a user enters an invalid entry?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. I'd suggest that an invalid entry (specifically an invalid key, one that's not sql in this case) be surfaced as a warning/error to console and the logs so that the administrator knows that the denylist was not processed correctly. @otan thoughts here?


It is important to support a file-managed denylist so that database administrators are able to quickly deny queries that have been confirmed to bring down clusters to the point of unresponsiveness. In this situation, SQL commands cannot be run from the console and cluster settings will not be able to be set.

Without this functionality, there may be applications endlessly retrying the “bad” queries that will continuously bring the cluster down when executed. There is currently no built-in way to stop execution of a query like this other than by preventing the query from being run in the first place, and it is time- and resource-intensive for developers to write their own.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is kind of a nit, but perhaps a word other than retries would be helpful. that makes me think of a query being retried because it failed. but i think we are also just worried about applications running more instances of an entire class of "bad queries" that it executes in the course of handling regular requests


To confirm, they can add a staging denylist to the location they had specified for the staging denylist and execute the query. The query will execute, and if the query matches the regex query pattern specified in the denylist, the administrator will be able to view this in the Admin UI, SQL console, and CRDB logs.

The administrator will then be able to add a denylist to the denylist location specified. Any queries after this moment will be matched against the denylist, and the denial logged for the administrator if there is a match. A time-series graph of queries denied, and queries that would be denied in the case of a staging denylist, will also be available to view on the Admin UI.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to clarify: the deny list won't affect any currently running queries? what about queries that are retried internally?


The administrator will then be able to add a denylist to the denylist location specified. Any queries after this moment will be matched against the denylist, and the denial logged for the administrator if there is a match. A time-series graph of queries denied, and queries that would be denied in the case of a staging denylist, will also be available to view on the Admin UI.

The administrator may modify the file content as needed; and if the entire denylist is no longer available, they may remove it entirely. The changes will take effect immediately.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does it mean for the entire denylist to longer be available?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was a typo — I probably meant "if the entire denylist is no longer necessary" Thanks for the catch!

## Detailed design

### Design overview
We will implement a file-managed denylist, stored locally on each machine as a YAML configuration file. The exact location of the denylist should be configured with a flag – this will allow users to specify the location of their file, such as via the k8s ConfigMap for CockroachCloud SREs. The file location will be watched so that as soon as the config file is created, modified, or removed, the denylist will be updated. Pattern matching between queries executed and queries on the denylist will be done via regex so users maintain granular control of the queries affected by the denylist.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's mentioned later -- there's a file watcher that notifies when a file is modified


It is important to support a file-managed denylist so that database administrators are able to quickly deny queries that have been confirmed to bring down clusters to the point of unresponsiveness. In this situation, SQL commands cannot be run from the console and cluster settings will not be able to be set.

Without this functionality, there may be applications endlessly retrying the “bad” queries that will continuously bring the cluster down when executed. There is currently no built-in way to stop execution of a query like this other than by preventing the query from being run in the first place, and it is time- and resource-intensive for developers to write their own.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kind of a nit, but i'd use a word other than retry here. retries make me think of a query being retried because the previous time failed. but we also are more concerned with the application generating more of a certain type of query in the process of handling more user requests.

Copy link
Author

@angelapwen angelapwen Oct 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense. I hadn't thought about how the word retry is overloaded in the context of application code vs. our database internals.


The administrator will then be able to add a denylist to the denylist location specified. Any queries after this moment will be matched against the denylist, and the denial logged for the administrator if there is a match. A time-series graph of queries denied, and queries that would be denied in the case of a staging denylist, will also be available to view on the Admin UI.

The administrator may modify the file content as needed; and if the entire denylist is no longer available, they may remove it entirely. The changes will take effect immediately.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does it mean for the entire denylist to longer be available?

## Detailed design

### Design overview
We will implement a file-managed denylist, stored locally on each machine as a YAML configuration file. The exact location of the denylist should be configured with a flag – this will allow users to specify the location of their file, such as via the k8s ConfigMap for CockroachCloud SREs. The file location will be watched so that as soon as the config file is created, modified, or removed, the denylist will be updated. Pattern matching between queries executed and queries on the denylist will be done via regex so users maintain granular control of the queries affected by the denylist.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's mentioned later -- we get notified whenever the file is modified

docs/RFCS/20201020_sql_query_denylist.md Show resolved Hide resolved

To confirm, they can add a staging denylist to the location they had specified for the staging denylist and execute the query. The query will execute, and if the query matches the regex query pattern specified in the denylist, the administrator will be able to view this in the Admin UI, SQL console, and CRDB logs.

The administrator will then be able to add a denylist to the denylist location specified. Any queries after this moment will be matched against the denylist, and the denial logged for the administrator if there is a match. A time-series graph of queries denied, and queries that would be denied in the case of a staging denylist, will also be available to view on the Admin UI.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so any currently running queries are not affected? what about queries that are retried internally?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. I don't believe that queries that are retried internally, or queries that are in the middle of running, should be affected — only new incoming queries.


The denylist on its own does not seem to warrant the addition of an additional abstraction in the form of a SQL proxy layer to our database product and we consider it out of scope of this RFC. If such a layer were to be added in the future, the denylist functionality should be moved there.

## Unresolved questions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add a section about testing? would be nice to see what sort of automated tests we can do. and also maybe some manual test cases to verify it works at the end and is a good experience

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into this. Thanks for the suggestion 💖

Copy link
Contributor

@awoods187 awoods187 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @angelapwen, @joshimhoff, @otan, @rafiss, and @vy-ton)


docs/RFCS/20201020_sql_query_denylist.md, line 10 at r1 (raw file):

# Summary

We propose a mechanism built into CockroachDB to prevent operator-defined SQL queries from being executed. This will immediately address the situations where clusters enter unresponsive states due to a query, or a query pattern, that has been identified. To bring the clusters back up as soon as possible and prevent them from going down again due to the same query, we need a query denylist on the database. This RFC details a file-managed SQL query denylist implementation that will allow database administrators to trigger query denial by uploading a YAML file comprising regular expression patterns to a customizable file location.  

We dont grant SSH access (and therefore the ability to use YAML) on CockroachCloud. As it is currently designed SREs could use this but our CC customers could not. What happens if a customer wants to override an SRE decision but can't configure this YAML file? This is straying beyond database-as-a-service and impinging on application management.


docs/RFCS/20201020_sql_query_denylist.md, line 11 at r1 (raw file):

We propose a mechanism built into CockroachDB to prevent operator-defined SQL queries from being executed. This will immediately address the situations where clusters enter unresponsive states due to a query, or a query pattern, that has been identified. To bring the clusters back up as soon as possible and prevent them from going down again due to the same query, we need a query denylist on the database. This RFC details a file-managed SQL query denylist implementation that will allow database administrators to trigger query denial by uploading a YAML file comprising regular expression patterns to a customizable file location.  
# Motivation

I'd like to see a discussion of what exists in Postgres and other database peers. My initial googling hasn't found a good comparison and it concerns me to deviate as heavily in this case. If they don't allow for this behavior perhaps we need to further reconsider.


docs/RFCS/20201020_sql_query_denylist.md, line 19 at r1 (raw file):

As blocking a query pattern is invasive and potentially dangerous, it is also important to add a staging denylist for administrators to test whether the query pattern accurately matches the queries they meant to deny or not. 

Having a query denylist will allow database administrators to mitigate these situations quickly and immediately without making any code changes. Additionally, this will give more flexibility and time for CockroachDB’s database and release engineers to work on database fixes relevant to the query. 

This feels very anti-developer in our thinking. I think we at least need some conversation in here about how operators and developers would interact. How would developers understand their query is blocked (I see this covered in part below but it feels incomplete)? Why can't developers continuously deploy a patch to the application that removes the problematic query? Or rollback to the previous version? It strikes me as dangerous to assume that operators (and particularly our SREs) know better than developers what the application needs can be. Why wouldn't we scale the cluster first and let developers/customers decide what the right action to take is (e.g., supporting the increase scale, rollingback or patching the problem).


docs/RFCS/20201020_sql_query_denylist.md, line 36 at r1 (raw file):

Previously, vy-ton (Vy Ton) wrote…

If we're assuming an unresponsive cluster, the Admin UI and SQL console would be inaccessible? So you could only confirm staged queries with logs

CC doesn't support log access for non-CRL labs employees so this also seems problematic. In addition, we would likely want to persist information in SQL/web console about what was denied and when, especially as we look toward persisting information in the future.


docs/RFCS/20201020_sql_query_denylist.md, line 124 at r1 (raw file):

Previously, vy-ton (Vy Ton) wrote…

In the phase where we think a file-based approach is required to meet the use case, this wouldn't be needed

you'd still likely want to know what was blocked historically when it came back


docs/RFCS/20201020_sql_query_denylist.md, line 145 at r1 (raw file):

### Pattern matching with regular expressions
There are a couple of drawbacks to consider due to the use of regular expressions to represent queries to be denied in our YAML file. The first is that it is slightly less user-friendly than exact fingerprint matching, because certain special characters in our queries will need to be escaped: eg. `SELECT \(\(a\)\) FROM b`. We accept this drawback because it gives the user writing the query much more fine-grained control than fingerprint matching does — with fingerprint matching, all queries matching a fingerprint will be blocked and the user would not be able to block, for example, queries for a specific database object.

YAML is a huge drawback. We ran into this via zone configs and people really disliked its usage. In part you could argue because they were using SQL and we forced a switch. I think this presents a similar problem.


docs/RFCS/20201020_sql_query_denylist.md, line 169 at r1 (raw file):

### Using file-managed denylist vs. cluster settings/system table
A file-managed system for the denylist was chosen over cluster settings and system table for this first iteration of the denylist. This is because the most urgent use case for our users is a situation in which clusters are entirely unresponsive, and therefore updating a denylist via cluster settings or system table would not be feasible. The file-managed denylist will mitigate the most extreme circumstance and therefore is the first denylist management system we will design and implement.

I still think it would be more natural to rollback an application change that caused this problem and/or scale the db resources.


docs/RFCS/20201020_sql_query_denylist.md, line 173 at r1 (raw file):

In the future, it is likely that we will add implementations of the denylist via cluster settings for broader use cases when the cluster is not unresponsive. At this point it would be possible to surface the functionality to the Web UI for an easier user experience.

### Denylist implementation in the database vs. SQL proxy

It strikes me as potentially not cloud-first to implement this now when SQL proxy is on the roadmap purely for on-prem users. Any work we do in SQL should not be motivated purely for on-prem users.

@otan
Copy link
Contributor

otan commented Oct 22, 2020


docs/RFCS/20201020_sql_query_denylist.md, line 11 at r1 (raw file):

Previously, awoods187 (Andy Woods) wrote…

I'd like to see a discussion of what exists in Postgres and other database peers. My initial googling hasn't found a good comparison and it concerns me to deviate as heavily in this case. If they don't allow for this behavior perhaps we need to further reconsider.

You're right in that most databases do not allow this kind of configurability - this is we brought up that this should really be done on a proxy layer instead. Though it evades me why we think this should go on the DB for now besides making it easier for CC to have something in the meantime. @joshimhoff is probably a good person to answer here.

@otan
Copy link
Contributor

otan commented Oct 22, 2020


docs/RFCS/20201020_sql_query_denylist.md, line 11 at r1 (raw file):

this is we brought up that this should really be done on a proxy layer instead
oops, why is why we brought up on whether this should be done on a proxy layer instead*

@otan
Copy link
Contributor

otan commented Oct 22, 2020


docs/RFCS/20201020_sql_query_denylist.md, line 145 at r1 (raw file):

Previously, awoods187 (Andy Woods) wrote…

YAML is a huge drawback. We ran into this via zone configs and people really disliked its usage. In part you could argue because they were using SQL and we forced a switch. I think this presents a similar problem.

Is there something that you would suggest as a replacement? Configuration of these items (as currently written) has to be file based due to the nature of the blocking. If we did this as a cluster setting / sql table, i can see that being different.

@otan
Copy link
Contributor

otan commented Oct 22, 2020


docs/RFCS/20201020_sql_query_denylist.md, line 19 at r1 (raw file):

Why can't developers continuously deploy a patch to the application that removes the problematic query?

I can imagine a situation where the DB operators have no control over applications being pushed. It may be quicker to deny something and wait for the application rollback to be more gradual.

Why wouldn't we scale the cluster first

I believe SREs have had the problem where no matter the size of the cluster, a command may always break.

@JuanLeon1
Copy link
Contributor


docs/RFCS/20201020_sql_query_denylist.md, line 19 at r1 (raw file):

Previously, otan (Oliver Tan) wrote…

Why can't developers continuously deploy a patch to the application that removes the problematic query?

I can imagine a situation where the DB operators have no control over applications being pushed. It may be quicker to deny something and wait for the application rollback to be more gradual.

Why wouldn't we scale the cluster first

I believe SREs have had the problem where no matter the size of the cluster, a command may always break.

I agree that the notification to developers that their query is being rejected needs to be solid and thought out, perhaps elaborated on a bit more.

I don't understand the rest. Let's say every developer carries a pager, and is able to diagnose and fix a problem and push a solution on their application as fast as an oncall SRE can add the query to the blacklist. The time to repair will see, in the first case, the whole database and all the other apps being down, and only the one app being down in the latter case. For a production system where downtime is money, the blacklist seems like profit. Even assuming all devs are always on call etc.

I don't think this is more than slightly "anti-developer". The developer can wake up at 3 am they want, and fix the problem, and remove the blacklist as they push the solution.

Further, a query of death often decimates any cluster of any size, because they are usually retried until the service stops responding. I got many fireplace stories to tell that all sound like that.

@JuanLeon1
Copy link
Contributor


docs/RFCS/20201020_sql_query_denylist.md, line 60 at r1 (raw file):

Previously, vy-ton (Vy Ton) wrote…

This list doesn't include arbitary queries and is all features.

@joshimhoff or @JuanLeon1 Is the ability to dynamically turn off specific features what CC SRE really needs? We wouldn't need a denylist to support arbitrary query patterns if we just need to prevent backups, CDC, temp tables etc?

Well turning off specific features would also be useful, but not what we are trying to get here. It is queries that kill the system, which people would not predict would kill the system, that we need to block when system keeps dying on startup. Some of these queries may come from features like backups, CDC, etc, but many of them come from normal apps that got unlucky or are too aggressive or both.

@JuanLeon1
Copy link
Contributor


docs/RFCS/20201020_sql_query_denylist.md, line 169 at r1 (raw file):

Previously, awoods187 (Andy Woods) wrote…

I still think it would be more natural to rollback an application change that caused this problem and/or scale the db resources.

It is not always an app change, can be that a table size went over a threshold, that the DB version changed along with e.g, the optimizer. Also the operator may not have the ability to roll back the app change that may even be from a client in a different company in North Korea. Regarding scale, apps often retry, and a bad query can consume infinite resources unless throttled.

More natural, perhaps, to someone, but better able to keep a service up, no.

@JuanLeon1
Copy link
Contributor


docs/RFCS/20201020_sql_query_denylist.md, line 29 at r1 (raw file):

Previously, angelapwen (Angela P Wen) wrote…

If I remember correctly, @JuanLeon1 mentioned in our call that having a separate file for the staging denylist would be preferable so that it could go through a separate pull request/review process for changes, for auditing purposes. Perhaps I misinterpreted and he meant that having a single file is preferable for efficiency?

I don't feel strongly but I think you want some people to have access to staging blocklists that you don't want to access production ones. I can see testing a list and then needing another approval from someone else to actually make it do its thing.

@JuanLeon1
Copy link
Contributor


docs/RFCS/20201020_sql_query_denylist.md, line 19 at r1 (raw file):

Previously, JuanLeon1 wrote…

I agree that the notification to developers that their query is being rejected needs to be solid and thought out, perhaps elaborated on a bit more.

I don't understand the rest. Let's say every developer carries a pager, and is able to diagnose and fix a problem and push a solution on their application as fast as an oncall SRE can add the query to the blacklist. The time to repair will see, in the first case, the whole database and all the other apps being down, and only the one app being down in the latter case. For a production system where downtime is money, the blacklist seems like profit. Even assuming all devs are always on call etc.

I don't think this is more than slightly "anti-developer". The developer can wake up at 3 am they want, and fix the problem, and remove the blacklist as they push the solution.

Further, a query of death often decimates any cluster of any size, because they are usually retried until the service stops responding. I got many fireplace stories to tell that all sound like that.

My first paragraph was wrong, in both cases the whole DB would be down. So if all devs carry pages and can diagnose/push fixes as fast as an oncall SRE can apply a blocklist, there is no benefit to the blocklist other than a bunch of unhappy, unbelievably good, devs.

Copy link
Author

@angelapwen angelapwen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you all for the great reviews 😸

Raphael, I would tend to agree with your recommendation that we should focus on the 1st use case you mentioned above, as our goal here was to create a "non-configurable" (by the application developer), global denylist. I hadn't realized that the two (or three, per Josh!) use cases were not so compatible until I looked through these reviews, but given that they are I agree that the first one is the one that we prefer to solve first.

It sounds like creating a global denylist like this will work as needed for the multi-tenant CC (serverless) case and, with few modifications surrounding communications, can be configured for the single-tenant CC (dedicated) case.

Is it possible to present this feature as an option for on-prem customers with their full knowledge that it does not come with end-user configurability and the option to "opt out"? If this is not acceptable, then we can exclude on-prem from our intended customers of this first version of the denylist implementation.

PS — the point about benchmarking is a good one @knz. I'll discuss further details with Oliver and update progress or implementation details via this PR/RFC.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @angelapwen, @awoods187, @joshimhoff, @knz, @otan, @rafiss, and @vy-ton)


docs/RFCS/20201020_sql_query_denylist.md, line 29 at r1 (raw file):

Previously, joshimhoff (Josh Imhoff) wrote…

We certainly need some way to test out additions to the denylist. This tool is too dangerous not to have that capability.

I think I actually prefer a single file. The benefit is that the query you are testing in the staging denylist is def the same as the one you are about to actually block in the prod one. Since same file. If two files, maybe you copy & paste wrong, and then your test via the staging denylist is a bad test.

Wonder what Juan thinks about ^^.

That's fair. I worry that there could be danger in using the same file as well (accidentally removing the 'staging' key before it is ready or something like that). Open to either.


docs/RFCS/20201020_sql_query_denylist.md, line 58 at r1 (raw file):

Previously, joshimhoff (Josh Imhoff) wrote…

Should we allow adding text to include in the error message? This way, SRE can communicate with the user about why the query is being blocked?

That seems like a good idea to me. There can be default error message text (what is displayed now) and an optional field that will allow a custom error message.


docs/RFCS/20201020_sql_query_denylist.md, line 76 at r1 (raw file):

Previously, knz (kena) wrote…

To clarify my last comment on this point: I am not suggesting to turn this thing into a cluster setting. We can still use files. But the reloading of the file should not be automatic and instead there should be a mechanism to trigger it.

For example, we trigger rotation of TLS certificates using a unix signal, and we trigger other types of file reloads using HTTP/RPC APIs.

Could you clarify which comment you meant about notification? I am not understanding why fsnotify does not work well when files in a container mount are updated.

I understand the point about invalid intermediate states — most of these result in a different file extension that could be ignored by the watcher, right?

I didn't know that we already use a pattern of triggering reloads using HTTP/RPC APIs. Do you know off the top of your head where I can dig into these a bit more?


docs/RFCS/20201020_sql_query_denylist.md, line 106 at r1 (raw file):

Previously, joshimhoff (Josh Imhoff) wrote…

Interesting!

That is interesting — at the current stage we are only checking against the canonical SQL inside CockroachDB after transformations. This is not always intuitive to the operator configuring the YAML file — though if CC SREs are the primary customer, they would perhaps be able to validate with database SQL engineers sooner than on-prem operators.

The latter case mentioned could be checked for by using the ^ and $ anchors, though this may require that the operator write pretty fine-grained, complex regular expressions for the file.


docs/RFCS/20201020_sql_query_denylist.md, line 124 at r1 (raw file):

Previously, joshimhoff (Josh Imhoff) wrote…

Why would these graphs not be needed with a file-based approach? These graphs seem very useful to me.

I think @vy-ton meant that they wouldn't be accessible in case the cluster is fully down, but yes when they are back up they would be useful to check later on.


docs/RFCS/20201020_sql_query_denylist.md, line 145 at r1 (raw file):

Previously, joshimhoff (Josh Imhoff) wrote…

This is a thorny problem. No one likes really any of the available languages for configuring applications. There is not really a "good" one.

YAML has comments so it is better than JSON for this use-case IMHO.

What we design ourselves very well may be worse than YAML. It is a common thing to do to invent a new (special-case) config language bc/ the existing ones are "bad"... then realize oh my ours is a mess also!

I think zone configs are logically part of the schema of an application. So should be SQL. That is not the case here.

I'd also like to avoid writing a new config language for this case if at all possible, for all the reasons you mentioned 😄


docs/RFCS/20201020_sql_query_denylist.md, line 181 at r1 (raw file):

Previously, joshimhoff (Josh Imhoff) wrote…

I think this an important Q! I will be interested in following along & seeing how DB engineers answer Qs like this.

Thanks Josh! Will try to keep you updated as these benchmarks move forward. It looks like Raphael also wanted the benchmarking process to be included as part of the RFC.

@JuanLeon1
Copy link
Contributor


docs/RFCS/20201020_sql_query_denylist.md, line 11 at r1 (raw file):

Previously, knz (kena) wrote…

FWIW one of our major customers ever since 2016 has been pushing for this feature, because they use an equivalent feature in Oracle.

We need to understand what Oracle provides and how they achieve it. That feature is loved by DBAs.

There are two things that I remember Robert (from sales eng) has been very adamant about:

  • the filtering needs to be done per application (not globally): i.e. a query must be able to succeed for one value of application_name even when it is blocked for another
  • the filtering needs to be disabled for queries sent by administrator users (I'd say, for users with a special privilege - we'd need to define a new privilege maybe, ask Solon)

More interesting than what Postgress does is, I think, what Managed SQL DBs do; e.g, Google Cloud SQL backed with postgress.
An on-prem, customer-controlled and -endured database would not need this. A DB that one party has an SLA to keep up without controlling workload from remote clients needs something like it.
E.g, spanner was built as an internal SasS and I am pretty sure Spanner SREs have this.
We can build it in the proxy, sure, but any on-prem customer of any size will want it. So we should be packaging the proxy with CRDB, which introduces some complexity.

I think "cloud-first" means managed-first, meaning there is a split of agency between the DB service operators and the app operators. Would we insist that the former have no control over the workload presented to them, and thereby relinquish control over their ability to maintain the service?

@JuanLeon1
Copy link
Contributor


docs/RFCS/20201020_sql_query_denylist.md, line 29 at r1 (raw file):

Previously, angelapwen (Angela P Wen) wrote…

That's fair. I worry that there could be danger in using the same file as well (accidentally removing the 'staging' key before it is ready or something like that). Open to either.

I can see Josh's point. The downside is that presumably production blocks require one of few people to approve, which means that even test and staging blocks will be starved for attention and approval.

@JuanLeon1
Copy link
Contributor


docs/RFCS/20201020_sql_query_denylist.md, line 106 at r1 (raw file):

Previously, angelapwen (Angela P Wen) wrote…

That is interesting — at the current stage we are only checking against the canonical SQL inside CockroachDB after transformations. This is not always intuitive to the operator configuring the YAML file — though if CC SREs are the primary customer, they would perhaps be able to validate with database SQL engineers sooner than on-prem operators.

The latter case mentioned could be checked for by using the ^ and $ anchors, though this may require that the operator write pretty fine-grained, complex regular expressions for the file.

Is the canonical SQL what shows up in the table of queries? IN logs? That would be what operators would use to block.
CC SREs are not the primary customer -- all SREs in customer orgs will also want to use this if it is available.

@JuanLeon1
Copy link
Contributor


docs/RFCS/20201020_sql_query_denylist.md, line 145 at r1 (raw file):

Previously, angelapwen (Angela P Wen) wrote…

I'd also like to avoid writing a new config language for this case if at all possible, for all the reasons you mentioned 😄

+1 on YACL (yet-another-configuration-language). My rule of thumb for configs: 1) JSON if a human never needs to read or write 2) YAML is config is simple enough -- as is the case here and 3) JSONET if need inheritance, modules, and still human readable,

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved comment to thread

@JuanLeon1
Copy link
Contributor


docs/RFCS/20201020_sql_query_denylist.md, line 173 at r1 (raw file):

Previously, joshimhoff (Josh Imhoff) wrote…

I do think on-prem folks want this feature. I would if I was running CRDB on-prem. Raphael points out customers have asked for it.

OTOH, if putting it in the SQL proxy gets CC SRE what we want, I'd be okay with that too. The main thing we'll have to figure out if we can drop the SQL proxy into dedicated clusters; if we can't do that, then putting it there doesn't help SRE so much.

And on-prem customers will want this and would then have to get the proxy, so maybe need to ship proxy by default, which is not very different from putting the feature in CRDB.

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @angelapwen, @awoods187, @joshimhoff, @knz, @otan, @rafiss, and @vy-ton)


docs/RFCS/20201020_sql_query_denylist.md, line 173 at r1 (raw file):

Previously, JuanLeon1 wrote…

And on-prem customers will want this and would then have to get the proxy, so maybe need to ship proxy by default, which is not very different from putting the feature in CRDB.

Just my 2cents -- if we are explicitly deciding to address the CC SRE use-case, and not address the self-hosted DBA use-case, then I view that as an argument against building this feature inside the cockroach server. Otherwise, we will be releasing a feature to all our self-hosted users that was not intended for them, and we will bear the maintenance and support costs as they ask us to change the feature to their liking.

(moved my comment from the top level comments to this thread)

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @angelapwen, @awoods187, @joshimhoff, @JuanLeon1, @knz, @otan, @rafiss, and @vy-ton)


docs/RFCS/20201020_sql_query_denylist.md, line 11 at r1 (raw file):

Previously, JuanLeon1 wrote…

More interesting than what Postgress does is, I think, what Managed SQL DBs do; e.g, Google Cloud SQL backed with postgress.
An on-prem, customer-controlled and -endured database would not need this. A DB that one party has an SLA to keep up without controlling workload from remote clients needs something like it.
E.g, spanner was built as an internal SasS and I am pretty sure Spanner SREs have this.
We can build it in the proxy, sure, but any on-prem customer of any size will want it. So we should be packaging the proxy with CRDB, which introduces some complexity.

I think "cloud-first" means managed-first, meaning there is a split of agency between the DB service operators and the app operators. Would we insist that the former have no control over the workload presented to them, and thereby relinquish control over their ability to maintain the service?

Juan my experience of other teams building hosted databases is that they implement two separate features in their infrastructure:

  • there is a global filter that applies to every hosted service equally, which can be deployed "quickly" by SREs and is not controllable by customers
  • there is a per-application / service filter entirely controllable by customers (with limited control for SREs)

I do not think that we wish Angela to work on both at this time. We should really focus.


docs/RFCS/20201020_sql_query_denylist.md, line 76 at r1 (raw file):
fsnotify is relying on the event notifcation from the linux kernel working properly, which it does not for certain ways docker runs volume mounts.
For example, a fsnotify-based solution would certainly not work when running docker on top of macOS.

most of these result in a different file extension that could be ignored by the watcher, right?

No that is not my experience. Text editors overwrite files in-place, especially when the person editing tends to press Ctrl+S regularly as they write, which is customary.

Do you know off the top of your head where I can dig into these a bit more?

cli/start.go for the signal handling

server/admin.go for the RPCs

that said I do not believe you need to do a deep analysis at this point of the RFC. You can simply explain "there's going to be a trigger mechanism".


docs/RFCS/20201020_sql_query_denylist.md, line 106 at r1 (raw file):

the canonical SQL inside CockroachDB after transformations. This is not always intuitive to the operator configuring the YAML file — though if CC SREs are the primary customer, they would perhaps be able to validate with database SQL engineers sooner than on-prem operators.

I think this paragraph is key to determine the next course of action. The story you're sketching in this explanation is the following:

  1. malicious user sends "complicated" SQL to crdb
  2. crdb parses and normalizes the SQL to something simpler
  3. however during execution the node crashes / becomes starved for resources
  4. the SRE gets notified of the situation and needs to "do something about it"

However at the point step 4 happens, crdb is already down and inoperative. Therefore, at that point in time the SRE cannot any more ask the crdb node to show what the "canonical" SQL looks like. It is too late. The only thing the SRE may see is the "raw" SQL that is incoming on client connections presumably as they are trying to disrupt more nodes.

What I am pointing out is that a SRE that needs to "take action" during an attack must be able to do so with the least dependencies on the system under attack. It is unreasonable to expect them to 1) log in to a crdb node 2) do query parsing and planning 3) show the resulting SQL just so they can figure out which regexp to apply.

This clearly and directly (to my eyes) redirects the design towards a pattern match on the raw SQL before it "enters" crdb. Either in the SQL proxy or at the level of "pgwire".

This also directly reminds me of another thing we need to deal with: if there are SQL strings that cause problems in pgwire or the SQL executor before they get planned, then it's possible for them to "crash" or disrupt a node before the code inside the "execution" function is reached, where you were planning to introduce the pattern match.

I am also reminded that if a SQL string is extremely long, or if it contains a lot of placeholders, then pgwire is going to consume a lot of resources just reading the SQL. So simply waiting for the SQL to be entirely received by a crdb node makes it already too late for the check.

So we can't do this deep inside the SQL layer, after query normalization. And even in pgwire may be too late.

So overall I think this makes it clear that pattern matching needs to apply on the "raw" SQL, and as early as possible, and if possible before it "enters" the crdb node. So that tends to point to the SQL proxy, and this way my reasoning gets me to the same place Rafi also reached in his last comment.

Copy link
Author

@angelapwen angelapwen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks all for your input. Will regroup with @vy-ton and @otan today to discuss the great points brought up and consider the SQL proxy mechanism in more depth

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @angelapwen, @awoods187, @joshimhoff, @JuanLeon1, @knz, @otan, @rafiss, and @vy-ton)


docs/RFCS/20201020_sql_query_denylist.md, line 11 at r1 (raw file):

Previously, knz (kena) wrote…

Juan my experience of other teams building hosted databases is that they implement two separate features in their infrastructure:

  • there is a global filter that applies to every hosted service equally, which can be deployed "quickly" by SREs and is not controllable by customers
  • there is a per-application / service filter entirely controllable by customers (with limited control for SREs)

I do not think that we wish Angela to work on both at this time. We should really focus.

I agree that we should be focusing on the former: the global filter that is deployed by SREs and does not have application developer configurability.


docs/RFCS/20201020_sql_query_denylist.md, line 76 at r1 (raw file):

Previously, knz (kena) wrote…

fsnotify is relying on the event notifcation from the linux kernel working properly, which it does not for certain ways docker runs volume mounts.
For example, a fsnotify-based solution would certainly not work when running docker on top of macOS.

most of these result in a different file extension that could be ignored by the watcher, right?

No that is not my experience. Text editors overwrite files in-place, especially when the person editing tends to press Ctrl+S regularly as they write, which is customary.

Do you know off the top of your head where I can dig into these a bit more?

cli/start.go for the signal handling

server/admin.go for the RPCs

that said I do not believe you need to do a deep analysis at this point of the RFC. You can simply explain "there's going to be a trigger mechanism".

Thank you, understood! 🙏

@angelapwen
Copy link
Author

Hi everyone 👋 thank you for all your input on the RFC thus far. We've decided to work on a small prototype solely for the SRE use case to begin with, and once the SREs have played around with it a bit we can continue discussions on this RFC with more information.

To that end we've moved conversation to #tmp-sql-denylist on Slack, where you're welcome to join us. We've had some interesting suggestions by @jordanlewis and @andy-kimball in the channel to add to the mix that may inform some larger changes to the RFC and prototype.

@angelapwen
Copy link
Author

Following up for interested parties. Based off conversation in the Slack channel, we decided to move forward with a simpler feature flag implementation for now to address CC SRE's use cases more quickly.

Thank you for your feedback again! We will leave this RFC draft if and when the denylist is revisited.

@knz
Copy link
Contributor

knz commented Nov 11, 2020

Please translate the document you linked to a new rfc document in the repository. We need a public record of this conversation

@angelapwen
Copy link
Author

Please translate the document you linked to a new rfc document in the repository. We need a public record of this conversation

Sure, I am happy to do this. I was under the impression that the feature flag implementation was not advanced enough to require its own RFC, but understand the need for public record somewhere outside of a Google Doc hosted in an intern's Google Drive. I will write a separate RFC document and link to this one.

craig bot pushed a commit that referenced this pull request Nov 12, 2020
55325: ui: Transactions link between Sessions and Statements r=dhartunian a=elkmaster

Resolves: #55131, #56244

Release note (admin ui change): Link to the Transactions page is now shown
between the Sessions and Statements links in the left hand navigation. This more
clearly reflects the hierarchy between the 3 concepts.

56346: testcluster: minor logging improvements r=andreimatei a=andreimatei

Log when TestCluster quiescing starts, and add a node log tag to each
node's quiescing ctx so messages from different nodes can be
disambiguated.

Release note: None

56437: cli, ui: dismiss release notes signup banner per environment variable r=knz,dhartunian a=nkodali

Previously, the signup banner could only be dismissed manually.
For internal testing purposes, this banner is unnecessary. This
change provides a way to dismiss the signup banner upon start of
a cluster via the cli by setting the environment variable
COCKROACH_UI_RELEASE_NOTES_SIGNUP_DISMISSED=true.

Resolves #46998

Release note: none

56533: backupccl: add feature flag support for BACKUP, RESTORE r=otan a=angelapwen

Follow-up to the RFC at #55778. This addresses the SRE use case mentioned in #51643 — instead of moving forward with a global denylist as the RFC indicated, we are prototyping feature flags via cluster settings to turn on/off requested features. The first part of the prototype will be done on `BACKUP` and `RESTORE` commands.

See [this doc](https://docs.google.com/document/d/1nZSdcK7YprL0P4TAuseY-mvlYnd82IaJ_ptAQDoWB6o/edit?) for further details. 

Note that the logic test under `ccl/backupccl/testdata/backup-restore/feature-flags` can be tested with the command `make test PKG=./pkg/ccl/backupccl TESTS='TestBackupRestoreDataDriven'`

— Commit message below — 

Adds a cluster setting to toggle a feature flag for the BACKUP and
RESTORE commands off and on; as well as a broad category for
Bulk IO commands. Currently disabling the cluster setting for Bulk
IO will only disable BACKUP and RESTORE jobs, but other types may
be included in this category in the future..

The feature is being introduced to address a Cockroach Cloud SRE
use case: needing  to disable certain categories of features in
case of cluster failure.

Release note (enterprise change): Adds cluster settings to enable/
disable the BACKUP and RESTORE commands. If a user attempts to use
these features while they are disabled, an error indicating that
the database administrator has disabled the feature is surfaced.

Example usage for the database administrator:
SET CLUSTER SETTING feature.bulkio.backup.enabled = FALSE;
SET CLUSTER SETTING feature.bulkio.backup.enabled = TRUE;
SET CLUSTER SETTING feature.bulkio.restore.enabled = FALSE;
SET CLUSTER SETTING feature.bulkio.restore.enabled = TRUE;
SET CLUSTER SETTING feature.bulkio.enabled = FALSE;
SET CLUSTER SETTING feature.bulkio.enabled = TRUE;

56591: ui: fix Overview screen in OSS builds r=dhartunian a=davepacheco

Previously, when using OSS builds (created with `make buildoss`), when
you loading the DB Console in your browser, you'd get "Page Not Found".
The route for the overview page was missing the leading '/'.  This bug
appears to have been introduced in
722c932.

Release note (admin ui change): This fixes a bug where users of
the OSS builds of CockroachDB would see "Page Not Found" when loading
the Console.

56600: roachpb: remove SetInner in favor of MustSetInner r=nvanbenschoten a=tbg

As of a recent commit, `ErrorDetail.SetInner` became unused, and
we can switch to a `MustSetInner` pattern for `ErrorDetail`. Since
the codegen involved is shared with {Request,Response}Union, those
lose the `SetInner` setter as well; we were always asserting on
the returned bool there anyway so this isn't changing anything.

Release note: None



Co-authored-by: Vlad Los <[email protected]>
Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Namrata Kodali <[email protected]>
Co-authored-by: angelapwen <[email protected]>
Co-authored-by: Joshua M. Clulow <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
craig bot pushed a commit that referenced this pull request Nov 13, 2020
56533: backupccl: add feature flag support for BACKUP, RESTORE r=otan a=angelapwen

Follow-up to the RFC at #55778. This addresses the SRE use case mentioned in #51643 — instead of moving forward with a global denylist as the RFC indicated, we are prototyping feature flags via cluster settings to turn on/off requested features. The first part of the prototype will be done on `BACKUP` and `RESTORE` commands.

See [this doc](https://docs.google.com/document/d/1nZSdcK7YprL0P4TAuseY-mvlYnd82IaJ_ptAQDoWB6o/edit?) for further details. 

Note that the logic test under `ccl/backupccl/testdata/backup-restore/feature-flags` can be tested with the command `make test PKG=./pkg/ccl/backupccl TESTS='TestBackupRestoreDataDriven'`

— Commit message below — 

Adds a cluster setting to toggle a feature flag for the BACKUP and
RESTORE commands off and on. The feature is being introduced to
address a Cockroach Cloud SRE use case: needing to disable certain
categories of features in case of cluster failure.

Release note (enterprise change): Adds cluster settings to enable/
disable the BACKUP and RESTORE commands. If a user attempts to use
these features while they are disabled, an error indicating that
the database administrator has disabled the feature is surfaced.

Example usage for the database administrator:
SET CLUSTER SETTING feature.backup.enabled = FALSE;
SET CLUSTER SETTING feature.backup.enabled = TRUE;
SET CLUSTER SETTING feature.restore.enabled = FALSE;
SET CLUSTER SETTING feature.restore.enabled = TRUE;

56591: ui: fix Overview screen in OSS builds r=dhartunian a=davepacheco

Previously, when using OSS builds (created with `make buildoss`), when
you loading the DB Console in your browser, you'd get "Page Not Found".
The route for the overview page was missing the leading '/'.  This bug
appears to have been introduced in
722c932.

Release note (admin ui change): This fixes a bug where users of
the OSS builds of CockroachDB would see "Page Not Found" when loading
the Console.

56631: kv: remove assertion around QuotaPool and Raft window size config r=nvanbenschoten a=nvanbenschoten

This check was well intentioned when added, but seems misguided
in retrospect. There are valid reasons why someone might want to
configure limits to various dimensions of the Raft window size
in ways that, when combined, exceeds the total window size limit
(i.e. the quota pool size).

Co-authored-by: angelapwen <[email protected]>
Co-authored-by: Joshua M. Clulow <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
@angelapwen
Copy link
Author

angelapwen commented Dec 7, 2020

Hi everyone 👋 as my internship winds down and I look to presenting on Thursday, I am providing an update here via comment on the status of the project and our design decisions thus far. As the current implementation was much simpler and did not require many comments, I am writing down the relevant notes here as opposed to another separate RFC.

Current Implementation — Feature Flags

Primary Differences

Instead of a denylist, we will move forward with an MVP of a feature flag-based way to block SQL queries based off of recent conversations in #tmp-sql-denylist. It is estimated that this mechanism will block >90% of the queries that SREs have needed to be blocked in the past and a finer-grained regex denylist is not indicated at this time. Additionally it seems that there will be a longer path to engineering consensus around designs for the regex denylist, and this implementation will allow us to more quickly address the most pressing issue at hand. The feature flag implementation means that we do not have need for a ‘staging’ version anymore, as there should not be ambiguity about whether a feature flag correctly matches the feature to be turned off (as compared to regular expression matching).

Another difference between this proposal and the previous one is that there will be a separation of concerns between what the SREs will handle on the Cockroach Cloud side vs. the database engineers on CRDB. Originally it was proposed that we would need a file-based mechanism so that SREs would be able to implement a denylist when clusters are unresponsive; but the consensus on #tmp-sql-denylist has been that a better design (proposed by @andy-kimball) would be for SREs to deny access for the particular user triggering the unresponsiveness; implement the query denying on the SQL side; and then allow access for the user again. This would allow for greater decoupling between the database and CC SRE work, and allows us to move forward using cluster settings for implementation.

Storage Mechanism — Cluster Settings

We use cluster settings to load in the flags we want to turn on. Every time the relevant cluster setting is set or changed, all nodes can parse the new configuration and store which features to block in a cache. Once the new configuration is cached/stored, the feature flag will be checked during the planning phase of command execution.

Our MVP includes boolean flags for each feature or feature category. So far we have: feature.backup.enabled, feature.restore.enabled, feature.import.enabled, feature.export.enabled, feature.changefeed.enabled, feature.stats.enabled, and feature.schema_change_enabled.

A sample PR for basic implementation on feature.stats.enabled is at #57076.

Metrics and Telemetry

A telemetry counter sql.feature_flag.denied has been added for each time a statement is denied due to a feature flag. Additionally, a metric counter sql.feature_flag_denial has been added in the same situation and will be displayed in a timeseries graph on the DB Console, in the SQL dashboard.

Future Extensions

Adding hierarchy to features

This option allows even greater ability to set and unset entire categories of features, such as schema changes; while maintaining the ability to set and unset more granular features as well.

Syntactic Sugar

This extension would wrap the same functionality mentioned above into a cleaner, more intuitive CLI command so that operators would not need to learn syntax around setting cluster settings.

Future Considerations

We consider many features out of scope for this MVP, but do want to consider how extending the MVP to allow for those features affect our design. These include:

  • How on-prem users would interact with the feature flags
  • How CC serverless (where only the system tenant can set a cluster setting) will be able to use this functionality on a per-tenant basis. For the purpose of this use case, only the file cluster setting or system table approach would be extensible, as there would need to be extra configurability around tenant-by-tenant settings/table entries.

I hope this is useful for any future readers who may be interested in expanding upon the feature flags or moving forward with the denylist! ✨

@joshimhoff
Copy link
Collaborator

Thank you for the writeup (& the feature flags)!!

@otan
Copy link
Contributor

otan commented Jan 14, 2021

do you mind replacing the RFC with the write up, let's merge this (so it's off my stuff to review list :P )

@knz
Copy link
Contributor

knz commented Jan 14, 2021

I asked angela to make the rfc. If you cannot review it then I'll do it.

@otan
Copy link
Contributor

otan commented Jan 14, 2021

Ah the bit in the comment just needs to be translated into the markdown and we should be good.

@otan
Copy link
Contributor

otan commented Jan 14, 2021

(no rush tho!)

@angelapwen
Copy link
Author

Oh, sorry I think I had misunderstood. I had thought we were going to leave this RFC PR 'open' until/if a denylist or file-mounted path were to be revisited.

To clarify, I should replace the original markdown file on denylist with the path we ended up going (summarized via the last comment) and merge it? @otan @knz

@knz
Copy link
Contributor

knz commented Jan 14, 2021

There are two possible situations:

  • if the description of the work that was done is a natural / small deviation from the main text of the RFC today, then I recommedn to update the RFC with the description of the work done, then merge that.

  • otherwise (large difference) you'd proceed as follows:

    1. edit the RFC that's already there with a small section that describes why that specific proposal was not realized
    2. mark the RFC (in the status field) as "rejected" or "postponed"
    3. add a new RFC doc with your description of the work done (or, at your leisure, a tech note in the tech note directory). If you go for an RFC, you would then mark it as "completed". You can add this 2nd document inside this PR.
    4. if at all possible, cross-link the two
    5. merge this PR.

@angelapwen
Copy link
Author

I see! I will do the second option as I believe they are quite fundamentally different designs and would benefit from having two separate docs.

@otan otan removed their request for review February 12, 2021 03:14
@tbg tbg added the X-noremind Bots won't notify about PRs with X-noremind label May 6, 2021
@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Dec 1, 2021

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


angelapwen seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-docs X-noremind Bots won't notify about PRs with X-noremind
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants