-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc: create SQL query denylist RFC #55778
base: master
Are you sure you want to change the base?
Conversation
5df9148
to
5c69867
Compare
f4d7de1
to
32749dc
Compare
(seems like you need to rebase and adjust your PR to have a release note for this to go green!) |
5ebf9a1
to
b8567f2
Compare
Release note: None Co-Authored-By: Oliver Tan <[email protected]>
b8567f2
to
b383614
Compare
|
||
It is important to support a file-managed denylist so that database administrators are able to quickly deny queries that have been confirmed to bring down clusters to the point of unresponsiveness. In this situation, SQL commands cannot be run from the console and cluster settings will not be able to be set. | ||
|
||
Without this functionality, there may be applications endlessly retrying the “bad” queries that will continuously bring the cluster down when executed. There is currently no built-in way to stop execution of a query like this other than by preventing the query from being run in the first place, and it is time- and resource-intensive for developers to write their own. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time- and resource-intensive for developers to write their own
Clarify what "write their own" means? I believe this refers to developers modifying the application code to stop running certain queries which is not always possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is what I meant — I'll modify to clarify.
|
||
# Summary | ||
|
||
We propose a mechanism built into CockroachDB to prevent operator-defined SQL queries from being executed. This will immediately address the situations where clusters enter unresponsive states due to a query, or a query pattern, that has been identified. To bring the clusters back up as soon as possible and prevent them from going down again due to the same query, we need a query denylist on the database. This RFC details a file-managed SQL query denylist implementation that will allow database administrators to trigger query denial by uploading a YAML file comprising regular expression patterns to a customizable file location. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DBAs would need file write permission to do this. What user/level of access do CC SREs have to CC clusters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the answer to this — perhaps an SRE reviewing here could answer? @joshimhoff or @JuanLeon1?
The denylist we define for this RFC and plan to implement is a global, non-configurable SQL query denylist in CockroachDB. A location for this denylist will be configured via a CLI flag, and if a denylist is uploaded to that location, any following queries matching any pattern on the denylist file will be denied. The supported file format for the denylist is YAML; each query pattern on the denylist file must be specified via regular expression conventions. | ||
|
||
### Staging denylist | ||
The staging denylist is intended to be used so that a user will be able to test whether the query patterns on their denylist deny the appropriate queries, without actually blocking them. A separate location for the staging denylist will be configured via a CLI flag. Otherwise, it should be formatted in the same way as the denylist. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open question about whether we need a staging denylist.
Assuming we do, a separate file seems more error-prone instead of marking queries as "staged" in a single denylist fiile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember correctly, @JuanLeon1 mentioned in our call that having a separate file for the staging denylist would be preferable so that it could go through a separate pull request/review process for changes, for auditing purposes. Perhaps I misinterpreted and he meant that having a single file is preferable for efficiency?
## New named concepts | ||
|
||
### Denylist | ||
The denylist we define for this RFC and plan to implement is a global, non-configurable SQL query denylist in CockroachDB. A location for this denylist will be configured via a CLI flag, and if a denylist is uploaded to that location, any following queries matching any pattern on the denylist file will be denied. The supported file format for the denylist is YAML; each query pattern on the denylist file must be specified via regular expression conventions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"non-configurable" is confusing here, I assume you mean non-configurable through SQL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, yes you're right because evidently it can be configured through modifying the YAML file 😄 I'll think of better wording for this or remove the word altogether.
### User experience from a database administrator's point of view | ||
A database administrator, anticipating that a denylist or staging denylist may be used in the future, uses a flag to configure the location where the denylist would be located when it is needed. When they notice a cluster is unresponsive, and isolate the root cause to a particular query (or multiple), they optionally may want to confirm the regex query pattern matches the queries they want to deny. | ||
|
||
To confirm, they can add a staging denylist to the location they had specified for the staging denylist and execute the query. The query will execute, and if the query matches the regex query pattern specified in the denylist, the administrator will be able to view this in the Admin UI, SQL console, and CRDB logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're assuming an unresponsive cluster, the Admin UI and SQL console would be inaccessible? So you could only confirm staged queries with logs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, in that situation they'd have to just confirm via logs. I'll update to clarify.
## Detailed design | ||
|
||
### Design overview | ||
We will implement a file-managed denylist, stored locally on each machine as a YAML configuration file. The exact location of the denylist should be configured with a flag – this will allow users to specify the location of their file, such as via the k8s ConfigMap for CockroachCloud SREs. The file location will be watched so that as soon as the config file is created, modified, or removed, the denylist will be updated. Pattern matching between queries executed and queries on the denylist will be done via regex so users maintain granular control of the queries affected by the denylist. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How frequently do we check the denylist for updates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it's mentioned later -- there's a file watcher that notifies when a file is modified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's mentioned later -- we get notified whenever the file is modified
# Queries to be denied; in the future we may support denying other fields. | ||
sql: | ||
# Denies the experimental temp tables feature to be turned on. | ||
- “SET experimental_enable_temp_tables=on” |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this include all cluster settings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand what you mean here. Do you mean that all cluster settings could be added to this denylist? I think that's right. They can always be removed.
### File-managed denylist implementation | ||
We expect users to create their own YAML file with a `sql` field that lists an array of strings representing regex to match any SQL queries to be denied. This allows granular control of specific command types as well as database objects — a user should be able to denylist any command of a certain type, any command pertaining to a particular database object, or any additional parameters that may be affecting the cluster. | ||
|
||
In the future, we could extend the functionality with additional fields such as, for example, `IP` to support denying queries made from specific IP addresses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This list doesn't include arbitary queries and is all features.
@joshimhoff or @JuanLeon1 Is the ability to dynamically turn off specific features what CC SRE really needs? We wouldn't need a denylist to support arbitrary query patterns if we just need to prevent backups, CDC, temp tables etc?
### Metrics logging | ||
When queries are denied, a level 2 event will be logged: `denylist match found: query %s denied, pattern matched %s`. | ||
|
||
Several graphs will be added to the Admin UI, including: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the phase where we think a file-based approach is required to meet the use case, this wouldn't be needed
|
||
Example denylist: **denylist.yaml** | ||
``` | ||
# Queries to be denied; in the future we may support denying other fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if a user enters an invalid entry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair. I'd suggest that an invalid entry (specifically an invalid key, one that's not sql
in this case) be surfaced as a warning/error to console and the logs so that the administrator knows that the denylist was not processed correctly. @otan thoughts here?
|
||
It is important to support a file-managed denylist so that database administrators are able to quickly deny queries that have been confirmed to bring down clusters to the point of unresponsiveness. In this situation, SQL commands cannot be run from the console and cluster settings will not be able to be set. | ||
|
||
Without this functionality, there may be applications endlessly retrying the “bad” queries that will continuously bring the cluster down when executed. There is currently no built-in way to stop execution of a query like this other than by preventing the query from being run in the first place, and it is time- and resource-intensive for developers to write their own. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is kind of a nit, but perhaps a word other than retries would be helpful. that makes me think of a query being retried because it failed. but i think we are also just worried about applications running more instances of an entire class of "bad queries" that it executes in the course of handling regular requests
|
||
To confirm, they can add a staging denylist to the location they had specified for the staging denylist and execute the query. The query will execute, and if the query matches the regex query pattern specified in the denylist, the administrator will be able to view this in the Admin UI, SQL console, and CRDB logs. | ||
|
||
The administrator will then be able to add a denylist to the denylist location specified. Any queries after this moment will be matched against the denylist, and the denial logged for the administrator if there is a match. A time-series graph of queries denied, and queries that would be denied in the case of a staging denylist, will also be available to view on the Admin UI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to clarify: the deny list won't affect any currently running queries? what about queries that are retried internally?
|
||
The administrator will then be able to add a denylist to the denylist location specified. Any queries after this moment will be matched against the denylist, and the denial logged for the administrator if there is a match. A time-series graph of queries denied, and queries that would be denied in the case of a staging denylist, will also be available to view on the Admin UI. | ||
|
||
The administrator may modify the file content as needed; and if the entire denylist is no longer available, they may remove it entirely. The changes will take effect immediately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does it mean for the entire denylist to longer be available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was a typo — I probably meant "if the entire denylist is no longer necessary" Thanks for the catch!
## Detailed design | ||
|
||
### Design overview | ||
We will implement a file-managed denylist, stored locally on each machine as a YAML configuration file. The exact location of the denylist should be configured with a flag – this will allow users to specify the location of their file, such as via the k8s ConfigMap for CockroachCloud SREs. The file location will be watched so that as soon as the config file is created, modified, or removed, the denylist will be updated. Pattern matching between queries executed and queries on the denylist will be done via regex so users maintain granular control of the queries affected by the denylist. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it's mentioned later -- there's a file watcher that notifies when a file is modified
|
||
It is important to support a file-managed denylist so that database administrators are able to quickly deny queries that have been confirmed to bring down clusters to the point of unresponsiveness. In this situation, SQL commands cannot be run from the console and cluster settings will not be able to be set. | ||
|
||
Without this functionality, there may be applications endlessly retrying the “bad” queries that will continuously bring the cluster down when executed. There is currently no built-in way to stop execution of a query like this other than by preventing the query from being run in the first place, and it is time- and resource-intensive for developers to write their own. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kind of a nit, but i'd use a word other than retry here. retries make me think of a query being retried because the previous time failed. but we also are more concerned with the application generating more of a certain type of query in the process of handling more user requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense. I hadn't thought about how the word retry is overloaded in the context of application code vs. our database internals.
|
||
The administrator will then be able to add a denylist to the denylist location specified. Any queries after this moment will be matched against the denylist, and the denial logged for the administrator if there is a match. A time-series graph of queries denied, and queries that would be denied in the case of a staging denylist, will also be available to view on the Admin UI. | ||
|
||
The administrator may modify the file content as needed; and if the entire denylist is no longer available, they may remove it entirely. The changes will take effect immediately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does it mean for the entire denylist to longer be available?
## Detailed design | ||
|
||
### Design overview | ||
We will implement a file-managed denylist, stored locally on each machine as a YAML configuration file. The exact location of the denylist should be configured with a flag – this will allow users to specify the location of their file, such as via the k8s ConfigMap for CockroachCloud SREs. The file location will be watched so that as soon as the config file is created, modified, or removed, the denylist will be updated. Pattern matching between queries executed and queries on the denylist will be done via regex so users maintain granular control of the queries affected by the denylist. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's mentioned later -- we get notified whenever the file is modified
|
||
To confirm, they can add a staging denylist to the location they had specified for the staging denylist and execute the query. The query will execute, and if the query matches the regex query pattern specified in the denylist, the administrator will be able to view this in the Admin UI, SQL console, and CRDB logs. | ||
|
||
The administrator will then be able to add a denylist to the denylist location specified. Any queries after this moment will be matched against the denylist, and the denial logged for the administrator if there is a match. A time-series graph of queries denied, and queries that would be denied in the case of a staging denylist, will also be available to view on the Admin UI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so any currently running queries are not affected? what about queries that are retried internally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. I don't believe that queries that are retried internally, or queries that are in the middle of running, should be affected — only new incoming queries.
|
||
The denylist on its own does not seem to warrant the addition of an additional abstraction in the form of a SQL proxy layer to our database product and we consider it out of scope of this RFC. If such a layer were to be added in the future, the denylist functionality should be moved there. | ||
|
||
## Unresolved questions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we add a section about testing? would be nice to see what sort of automated tests we can do. and also maybe some manual test cases to verify it works at the end and is a good experience
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into this. Thanks for the suggestion 💖
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @angelapwen, @joshimhoff, @otan, @rafiss, and @vy-ton)
docs/RFCS/20201020_sql_query_denylist.md, line 10 at r1 (raw file):
# Summary We propose a mechanism built into CockroachDB to prevent operator-defined SQL queries from being executed. This will immediately address the situations where clusters enter unresponsive states due to a query, or a query pattern, that has been identified. To bring the clusters back up as soon as possible and prevent them from going down again due to the same query, we need a query denylist on the database. This RFC details a file-managed SQL query denylist implementation that will allow database administrators to trigger query denial by uploading a YAML file comprising regular expression patterns to a customizable file location.
We dont grant SSH access (and therefore the ability to use YAML) on CockroachCloud. As it is currently designed SREs could use this but our CC customers could not. What happens if a customer wants to override an SRE decision but can't configure this YAML file? This is straying beyond database-as-a-service and impinging on application management.
docs/RFCS/20201020_sql_query_denylist.md, line 11 at r1 (raw file):
We propose a mechanism built into CockroachDB to prevent operator-defined SQL queries from being executed. This will immediately address the situations where clusters enter unresponsive states due to a query, or a query pattern, that has been identified. To bring the clusters back up as soon as possible and prevent them from going down again due to the same query, we need a query denylist on the database. This RFC details a file-managed SQL query denylist implementation that will allow database administrators to trigger query denial by uploading a YAML file comprising regular expression patterns to a customizable file location. # Motivation
I'd like to see a discussion of what exists in Postgres and other database peers. My initial googling hasn't found a good comparison and it concerns me to deviate as heavily in this case. If they don't allow for this behavior perhaps we need to further reconsider.
docs/RFCS/20201020_sql_query_denylist.md, line 19 at r1 (raw file):
As blocking a query pattern is invasive and potentially dangerous, it is also important to add a staging denylist for administrators to test whether the query pattern accurately matches the queries they meant to deny or not. Having a query denylist will allow database administrators to mitigate these situations quickly and immediately without making any code changes. Additionally, this will give more flexibility and time for CockroachDB’s database and release engineers to work on database fixes relevant to the query.
This feels very anti-developer in our thinking. I think we at least need some conversation in here about how operators and developers would interact. How would developers understand their query is blocked (I see this covered in part below but it feels incomplete)? Why can't developers continuously deploy a patch to the application that removes the problematic query? Or rollback to the previous version? It strikes me as dangerous to assume that operators (and particularly our SREs) know better than developers what the application needs can be. Why wouldn't we scale the cluster first and let developers/customers decide what the right action to take is (e.g., supporting the increase scale, rollingback or patching the problem).
docs/RFCS/20201020_sql_query_denylist.md, line 36 at r1 (raw file):
Previously, vy-ton (Vy Ton) wrote…
If we're assuming an unresponsive cluster, the Admin UI and SQL console would be inaccessible? So you could only confirm staged queries with logs
CC doesn't support log access for non-CRL labs employees so this also seems problematic. In addition, we would likely want to persist information in SQL/web console about what was denied and when, especially as we look toward persisting information in the future.
docs/RFCS/20201020_sql_query_denylist.md, line 124 at r1 (raw file):
Previously, vy-ton (Vy Ton) wrote…
In the phase where we think a file-based approach is required to meet the use case, this wouldn't be needed
you'd still likely want to know what was blocked historically when it came back
docs/RFCS/20201020_sql_query_denylist.md, line 145 at r1 (raw file):
### Pattern matching with regular expressions There are a couple of drawbacks to consider due to the use of regular expressions to represent queries to be denied in our YAML file. The first is that it is slightly less user-friendly than exact fingerprint matching, because certain special characters in our queries will need to be escaped: eg. `SELECT \(\(a\)\) FROM b`. We accept this drawback because it gives the user writing the query much more fine-grained control than fingerprint matching does — with fingerprint matching, all queries matching a fingerprint will be blocked and the user would not be able to block, for example, queries for a specific database object.
YAML is a huge drawback. We ran into this via zone configs and people really disliked its usage. In part you could argue because they were using SQL and we forced a switch. I think this presents a similar problem.
docs/RFCS/20201020_sql_query_denylist.md, line 169 at r1 (raw file):
### Using file-managed denylist vs. cluster settings/system table A file-managed system for the denylist was chosen over cluster settings and system table for this first iteration of the denylist. This is because the most urgent use case for our users is a situation in which clusters are entirely unresponsive, and therefore updating a denylist via cluster settings or system table would not be feasible. The file-managed denylist will mitigate the most extreme circumstance and therefore is the first denylist management system we will design and implement.
I still think it would be more natural to rollback an application change that caused this problem and/or scale the db resources.
docs/RFCS/20201020_sql_query_denylist.md, line 173 at r1 (raw file):
In the future, it is likely that we will add implementations of the denylist via cluster settings for broader use cases when the cluster is not unresponsive. At this point it would be possible to surface the functionality to the Web UI for an easier user experience. ### Denylist implementation in the database vs. SQL proxy
It strikes me as potentially not cloud-first to implement this now when SQL proxy is on the roadmap purely for on-prem users. Any work we do in SQL should not be motivated purely for on-prem users.
docs/RFCS/20201020_sql_query_denylist.md, line 11 at r1 (raw file): Previously, awoods187 (Andy Woods) wrote…
You're right in that most databases do not allow this kind of configurability - this is we brought up that this should really be done on a proxy layer instead. Though it evades me why we think this should go on the DB for now besides making it easier for CC to have something in the meantime. @joshimhoff is probably a good person to answer here. |
docs/RFCS/20201020_sql_query_denylist.md, line 11 at r1 (raw file):
|
docs/RFCS/20201020_sql_query_denylist.md, line 145 at r1 (raw file): Previously, awoods187 (Andy Woods) wrote…
Is there something that you would suggest as a replacement? Configuration of these items (as currently written) has to be file based due to the nature of the blocking. If we did this as a cluster setting / sql table, i can see that being different. |
docs/RFCS/20201020_sql_query_denylist.md, line 19 at r1 (raw file):
I can imagine a situation where the DB operators have no control over applications being pushed. It may be quicker to deny something and wait for the application rollback to be more gradual.
I believe SREs have had the problem where no matter the size of the cluster, a command may always break. |
docs/RFCS/20201020_sql_query_denylist.md, line 19 at r1 (raw file): Previously, otan (Oliver Tan) wrote…
I agree that the notification to developers that their query is being rejected needs to be solid and thought out, perhaps elaborated on a bit more. I don't understand the rest. Let's say every developer carries a pager, and is able to diagnose and fix a problem and push a solution on their application as fast as an oncall SRE can add the query to the blacklist. The time to repair will see, in the first case, the whole database and all the other apps being down, and only the one app being down in the latter case. For a production system where downtime is money, the blacklist seems like profit. Even assuming all devs are always on call etc. I don't think this is more than slightly "anti-developer". The developer can wake up at 3 am they want, and fix the problem, and remove the blacklist as they push the solution. Further, a query of death often decimates any cluster of any size, because they are usually retried until the service stops responding. I got many fireplace stories to tell that all sound like that. |
docs/RFCS/20201020_sql_query_denylist.md, line 60 at r1 (raw file): Previously, vy-ton (Vy Ton) wrote…
Well turning off specific features would also be useful, but not what we are trying to get here. It is queries that kill the system, which people would not predict would kill the system, that we need to block when system keeps dying on startup. Some of these queries may come from features like backups, CDC, etc, but many of them come from normal apps that got unlucky or are too aggressive or both. |
docs/RFCS/20201020_sql_query_denylist.md, line 169 at r1 (raw file): Previously, awoods187 (Andy Woods) wrote…
It is not always an app change, can be that a table size went over a threshold, that the DB version changed along with e.g, the optimizer. Also the operator may not have the ability to roll back the app change that may even be from a client in a different company in North Korea. Regarding scale, apps often retry, and a bad query can consume infinite resources unless throttled. More natural, perhaps, to someone, but better able to keep a service up, no. |
docs/RFCS/20201020_sql_query_denylist.md, line 29 at r1 (raw file): Previously, angelapwen (Angela P Wen) wrote…
I don't feel strongly but I think you want some people to have access to staging blocklists that you don't want to access production ones. I can see testing a list and then needing another approval from someone else to actually make it do its thing. |
docs/RFCS/20201020_sql_query_denylist.md, line 19 at r1 (raw file): Previously, JuanLeon1 wrote…
My first paragraph was wrong, in both cases the whole DB would be down. So if all devs carry pages and can diagnose/push fixes as fast as an oncall SRE can apply a blocklist, there is no benefit to the blocklist other than a bunch of unhappy, unbelievably good, devs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you all for the great reviews 😸
Raphael, I would tend to agree with your recommendation that we should focus on the 1st use case you mentioned above, as our goal here was to create a "non-configurable" (by the application developer), global denylist. I hadn't realized that the two (or three, per Josh!) use cases were not so compatible until I looked through these reviews, but given that they are I agree that the first one is the one that we prefer to solve first.
It sounds like creating a global denylist like this will work as needed for the multi-tenant CC (serverless) case and, with few modifications surrounding communications, can be configured for the single-tenant CC (dedicated) case.
Is it possible to present this feature as an option for on-prem customers with their full knowledge that it does not come with end-user configurability and the option to "opt out"? If this is not acceptable, then we can exclude on-prem from our intended customers of this first version of the denylist implementation.
PS — the point about benchmarking is a good one @knz. I'll discuss further details with Oliver and update progress or implementation details via this PR/RFC.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @angelapwen, @awoods187, @joshimhoff, @knz, @otan, @rafiss, and @vy-ton)
docs/RFCS/20201020_sql_query_denylist.md, line 29 at r1 (raw file):
Previously, joshimhoff (Josh Imhoff) wrote…
We certainly need some way to test out additions to the denylist. This tool is too dangerous not to have that capability.
I think I actually prefer a single file. The benefit is that the query you are testing in the staging denylist is def the same as the one you are about to actually block in the prod one. Since same file. If two files, maybe you copy & paste wrong, and then your test via the staging denylist is a bad test.
Wonder what Juan thinks about ^^.
That's fair. I worry that there could be danger in using the same file as well (accidentally removing the 'staging' key before it is ready or something like that). Open to either.
docs/RFCS/20201020_sql_query_denylist.md, line 58 at r1 (raw file):
Previously, joshimhoff (Josh Imhoff) wrote…
Should we allow adding text to include in the error message? This way, SRE can communicate with the user about why the query is being blocked?
That seems like a good idea to me. There can be default error message text (what is displayed now) and an optional field that will allow a custom error message.
docs/RFCS/20201020_sql_query_denylist.md, line 76 at r1 (raw file):
Previously, knz (kena) wrote…
To clarify my last comment on this point: I am not suggesting to turn this thing into a cluster setting. We can still use files. But the reloading of the file should not be automatic and instead there should be a mechanism to trigger it.
For example, we trigger rotation of TLS certificates using a unix signal, and we trigger other types of file reloads using HTTP/RPC APIs.
Could you clarify which comment you meant about notification? I am not understanding why fsnotify
does not work well when files in a container mount are updated.
I understand the point about invalid intermediate states — most of these result in a different file extension that could be ignored by the watcher, right?
I didn't know that we already use a pattern of triggering reloads using HTTP/RPC APIs. Do you know off the top of your head where I can dig into these a bit more?
docs/RFCS/20201020_sql_query_denylist.md, line 106 at r1 (raw file):
Previously, joshimhoff (Josh Imhoff) wrote…
Interesting!
That is interesting — at the current stage we are only checking against the canonical SQL inside CockroachDB after transformations. This is not always intuitive to the operator configuring the YAML file — though if CC SREs are the primary customer, they would perhaps be able to validate with database SQL engineers sooner than on-prem operators.
The latter case mentioned could be checked for by using the ^
and $
anchors, though this may require that the operator write pretty fine-grained, complex regular expressions for the file.
docs/RFCS/20201020_sql_query_denylist.md, line 124 at r1 (raw file):
Previously, joshimhoff (Josh Imhoff) wrote…
Why would these graphs not be needed with a file-based approach? These graphs seem very useful to me.
I think @vy-ton meant that they wouldn't be accessible in case the cluster is fully down, but yes when they are back up they would be useful to check later on.
docs/RFCS/20201020_sql_query_denylist.md, line 145 at r1 (raw file):
Previously, joshimhoff (Josh Imhoff) wrote…
This is a thorny problem. No one likes really any of the available languages for configuring applications. There is not really a "good" one.
YAML has comments so it is better than JSON for this use-case IMHO.
What we design ourselves very well may be worse than YAML. It is a common thing to do to invent a new (special-case) config language bc/ the existing ones are "bad"... then realize oh my ours is a mess also!
I think zone configs are logically part of the schema of an application. So should be SQL. That is not the case here.
I'd also like to avoid writing a new config language for this case if at all possible, for all the reasons you mentioned 😄
docs/RFCS/20201020_sql_query_denylist.md, line 181 at r1 (raw file):
Previously, joshimhoff (Josh Imhoff) wrote…
I think this an important Q! I will be interested in following along & seeing how DB engineers answer Qs like this.
Thanks Josh! Will try to keep you updated as these benchmarks move forward. It looks like Raphael also wanted the benchmarking process to be included as part of the RFC.
docs/RFCS/20201020_sql_query_denylist.md, line 11 at r1 (raw file): Previously, knz (kena) wrote…
More interesting than what Postgress does is, I think, what Managed SQL DBs do; e.g, Google Cloud SQL backed with postgress. I think "cloud-first" means managed-first, meaning there is a split of agency between the DB service operators and the app operators. Would we insist that the former have no control over the workload presented to them, and thereby relinquish control over their ability to maintain the service? |
docs/RFCS/20201020_sql_query_denylist.md, line 29 at r1 (raw file): Previously, angelapwen (Angela P Wen) wrote…
I can see Josh's point. The downside is that presumably production blocks require one of few people to approve, which means that even test and staging blocks will be starved for attention and approval. |
docs/RFCS/20201020_sql_query_denylist.md, line 106 at r1 (raw file): Previously, angelapwen (Angela P Wen) wrote…
Is the canonical SQL what shows up in the table of queries? IN logs? That would be what operators would use to block. |
docs/RFCS/20201020_sql_query_denylist.md, line 145 at r1 (raw file): Previously, angelapwen (Angela P Wen) wrote…
+1 on YACL (yet-another-configuration-language). My rule of thumb for configs: 1) JSON if a human never needs to read or write 2) YAML is config is simple enough -- as is the case here and 3) JSONET if need inheritance, modules, and still human readable, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved comment to thread
docs/RFCS/20201020_sql_query_denylist.md, line 173 at r1 (raw file): Previously, joshimhoff (Josh Imhoff) wrote…
And on-prem customers will want this and would then have to get the proxy, so maybe need to ship proxy by default, which is not very different from putting the feature in CRDB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @angelapwen, @awoods187, @joshimhoff, @knz, @otan, @rafiss, and @vy-ton)
docs/RFCS/20201020_sql_query_denylist.md, line 173 at r1 (raw file):
Previously, JuanLeon1 wrote…
And on-prem customers will want this and would then have to get the proxy, so maybe need to ship proxy by default, which is not very different from putting the feature in CRDB.
Just my 2cents -- if we are explicitly deciding to address the CC SRE use-case, and not address the self-hosted DBA use-case, then I view that as an argument against building this feature inside the cockroach server. Otherwise, we will be releasing a feature to all our self-hosted users that was not intended for them, and we will bear the maintenance and support costs as they ask us to change the feature to their liking.
(moved my comment from the top level comments to this thread)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @angelapwen, @awoods187, @joshimhoff, @JuanLeon1, @knz, @otan, @rafiss, and @vy-ton)
docs/RFCS/20201020_sql_query_denylist.md, line 11 at r1 (raw file):
Previously, JuanLeon1 wrote…
More interesting than what Postgress does is, I think, what Managed SQL DBs do; e.g, Google Cloud SQL backed with postgress.
An on-prem, customer-controlled and -endured database would not need this. A DB that one party has an SLA to keep up without controlling workload from remote clients needs something like it.
E.g, spanner was built as an internal SasS and I am pretty sure Spanner SREs have this.
We can build it in the proxy, sure, but any on-prem customer of any size will want it. So we should be packaging the proxy with CRDB, which introduces some complexity.I think "cloud-first" means managed-first, meaning there is a split of agency between the DB service operators and the app operators. Would we insist that the former have no control over the workload presented to them, and thereby relinquish control over their ability to maintain the service?
Juan my experience of other teams building hosted databases is that they implement two separate features in their infrastructure:
- there is a global filter that applies to every hosted service equally, which can be deployed "quickly" by SREs and is not controllable by customers
- there is a per-application / service filter entirely controllable by customers (with limited control for SREs)
I do not think that we wish Angela to work on both at this time. We should really focus.
docs/RFCS/20201020_sql_query_denylist.md, line 76 at r1 (raw file):
fsnotify
is relying on the event notifcation from the linux kernel working properly, which it does not for certain ways docker runs volume mounts.
For example, a fsnotify-based solution would certainly not work when running docker on top of macOS.
most of these result in a different file extension that could be ignored by the watcher, right?
No that is not my experience. Text editors overwrite files in-place, especially when the person editing tends to press Ctrl+S regularly as they write, which is customary.
Do you know off the top of your head where I can dig into these a bit more?
cli/start.go
for the signal handling
server/admin.go
for the RPCs
that said I do not believe you need to do a deep analysis at this point of the RFC. You can simply explain "there's going to be a trigger mechanism".
docs/RFCS/20201020_sql_query_denylist.md, line 106 at r1 (raw file):
the canonical SQL inside CockroachDB after transformations. This is not always intuitive to the operator configuring the YAML file — though if CC SREs are the primary customer, they would perhaps be able to validate with database SQL engineers sooner than on-prem operators.
I think this paragraph is key to determine the next course of action. The story you're sketching in this explanation is the following:
- malicious user sends "complicated" SQL to crdb
- crdb parses and normalizes the SQL to something simpler
- however during execution the node crashes / becomes starved for resources
- the SRE gets notified of the situation and needs to "do something about it"
However at the point step 4 happens, crdb is already down and inoperative. Therefore, at that point in time the SRE cannot any more ask the crdb node to show what the "canonical" SQL looks like. It is too late. The only thing the SRE may see is the "raw" SQL that is incoming on client connections presumably as they are trying to disrupt more nodes.
What I am pointing out is that a SRE that needs to "take action" during an attack must be able to do so with the least dependencies on the system under attack. It is unreasonable to expect them to 1) log in to a crdb node 2) do query parsing and planning 3) show the resulting SQL just so they can figure out which regexp to apply.
This clearly and directly (to my eyes) redirects the design towards a pattern match on the raw SQL before it "enters" crdb. Either in the SQL proxy or at the level of "pgwire".
This also directly reminds me of another thing we need to deal with: if there are SQL strings that cause problems in pgwire or the SQL executor before they get planned, then it's possible for them to "crash" or disrupt a node before the code inside the "execution" function is reached, where you were planning to introduce the pattern match.
I am also reminded that if a SQL string is extremely long, or if it contains a lot of placeholders, then pgwire is going to consume a lot of resources just reading the SQL. So simply waiting for the SQL to be entirely received by a crdb node makes it already too late for the check.
So we can't do this deep inside the SQL layer, after query normalization. And even in pgwire may be too late.
So overall I think this makes it clear that pattern matching needs to apply on the "raw" SQL, and as early as possible, and if possible before it "enters" the crdb node. So that tends to point to the SQL proxy, and this way my reasoning gets me to the same place Rafi also reached in his last comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks all for your input. Will regroup with @vy-ton and @otan today to discuss the great points brought up and consider the SQL proxy mechanism in more depth
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @angelapwen, @awoods187, @joshimhoff, @JuanLeon1, @knz, @otan, @rafiss, and @vy-ton)
docs/RFCS/20201020_sql_query_denylist.md, line 11 at r1 (raw file):
Previously, knz (kena) wrote…
Juan my experience of other teams building hosted databases is that they implement two separate features in their infrastructure:
- there is a global filter that applies to every hosted service equally, which can be deployed "quickly" by SREs and is not controllable by customers
- there is a per-application / service filter entirely controllable by customers (with limited control for SREs)
I do not think that we wish Angela to work on both at this time. We should really focus.
I agree that we should be focusing on the former: the global filter that is deployed by SREs and does not have application developer configurability.
docs/RFCS/20201020_sql_query_denylist.md, line 76 at r1 (raw file):
Previously, knz (kena) wrote…
fsnotify
is relying on the event notifcation from the linux kernel working properly, which it does not for certain ways docker runs volume mounts.
For example, a fsnotify-based solution would certainly not work when running docker on top of macOS.most of these result in a different file extension that could be ignored by the watcher, right?
No that is not my experience. Text editors overwrite files in-place, especially when the person editing tends to press Ctrl+S regularly as they write, which is customary.
Do you know off the top of your head where I can dig into these a bit more?
cli/start.go
for the signal handling
server/admin.go
for the RPCsthat said I do not believe you need to do a deep analysis at this point of the RFC. You can simply explain "there's going to be a trigger mechanism".
Thank you, understood! 🙏
Hi everyone 👋 thank you for all your input on the RFC thus far. We've decided to work on a small prototype solely for the SRE use case to begin with, and once the SREs have played around with it a bit we can continue discussions on this RFC with more information. To that end we've moved conversation to #tmp-sql-denylist on Slack, where you're welcome to join us. We've had some interesting suggestions by @jordanlewis and @andy-kimball in the channel to add to the mix that may inform some larger changes to the RFC and prototype. |
Following up for interested parties. Based off conversation in the Slack channel, we decided to move forward with a simpler feature flag implementation for now to address CC SRE's use cases more quickly. Thank you for your feedback again! We will leave this RFC draft if and when the denylist is revisited. |
Please translate the document you linked to a new rfc document in the repository. We need a public record of this conversation |
Sure, I am happy to do this. I was under the impression that the feature flag implementation was not advanced enough to require its own RFC, but understand the need for public record somewhere outside of a Google Doc hosted in an intern's Google Drive. I will write a separate RFC document and link to this one. |
55325: ui: Transactions link between Sessions and Statements r=dhartunian a=elkmaster Resolves: #55131, #56244 Release note (admin ui change): Link to the Transactions page is now shown between the Sessions and Statements links in the left hand navigation. This more clearly reflects the hierarchy between the 3 concepts. 56346: testcluster: minor logging improvements r=andreimatei a=andreimatei Log when TestCluster quiescing starts, and add a node log tag to each node's quiescing ctx so messages from different nodes can be disambiguated. Release note: None 56437: cli, ui: dismiss release notes signup banner per environment variable r=knz,dhartunian a=nkodali Previously, the signup banner could only be dismissed manually. For internal testing purposes, this banner is unnecessary. This change provides a way to dismiss the signup banner upon start of a cluster via the cli by setting the environment variable COCKROACH_UI_RELEASE_NOTES_SIGNUP_DISMISSED=true. Resolves #46998 Release note: none 56533: backupccl: add feature flag support for BACKUP, RESTORE r=otan a=angelapwen Follow-up to the RFC at #55778. This addresses the SRE use case mentioned in #51643 — instead of moving forward with a global denylist as the RFC indicated, we are prototyping feature flags via cluster settings to turn on/off requested features. The first part of the prototype will be done on `BACKUP` and `RESTORE` commands. See [this doc](https://docs.google.com/document/d/1nZSdcK7YprL0P4TAuseY-mvlYnd82IaJ_ptAQDoWB6o/edit?) for further details. Note that the logic test under `ccl/backupccl/testdata/backup-restore/feature-flags` can be tested with the command `make test PKG=./pkg/ccl/backupccl TESTS='TestBackupRestoreDataDriven'` — Commit message below — Adds a cluster setting to toggle a feature flag for the BACKUP and RESTORE commands off and on; as well as a broad category for Bulk IO commands. Currently disabling the cluster setting for Bulk IO will only disable BACKUP and RESTORE jobs, but other types may be included in this category in the future.. The feature is being introduced to address a Cockroach Cloud SRE use case: needing to disable certain categories of features in case of cluster failure. Release note (enterprise change): Adds cluster settings to enable/ disable the BACKUP and RESTORE commands. If a user attempts to use these features while they are disabled, an error indicating that the database administrator has disabled the feature is surfaced. Example usage for the database administrator: SET CLUSTER SETTING feature.bulkio.backup.enabled = FALSE; SET CLUSTER SETTING feature.bulkio.backup.enabled = TRUE; SET CLUSTER SETTING feature.bulkio.restore.enabled = FALSE; SET CLUSTER SETTING feature.bulkio.restore.enabled = TRUE; SET CLUSTER SETTING feature.bulkio.enabled = FALSE; SET CLUSTER SETTING feature.bulkio.enabled = TRUE; 56591: ui: fix Overview screen in OSS builds r=dhartunian a=davepacheco Previously, when using OSS builds (created with `make buildoss`), when you loading the DB Console in your browser, you'd get "Page Not Found". The route for the overview page was missing the leading '/'. This bug appears to have been introduced in 722c932. Release note (admin ui change): This fixes a bug where users of the OSS builds of CockroachDB would see "Page Not Found" when loading the Console. 56600: roachpb: remove SetInner in favor of MustSetInner r=nvanbenschoten a=tbg As of a recent commit, `ErrorDetail.SetInner` became unused, and we can switch to a `MustSetInner` pattern for `ErrorDetail`. Since the codegen involved is shared with {Request,Response}Union, those lose the `SetInner` setter as well; we were always asserting on the returned bool there anyway so this isn't changing anything. Release note: None Co-authored-by: Vlad Los <[email protected]> Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Namrata Kodali <[email protected]> Co-authored-by: angelapwen <[email protected]> Co-authored-by: Joshua M. Clulow <[email protected]> Co-authored-by: Tobias Grieger <[email protected]>
56533: backupccl: add feature flag support for BACKUP, RESTORE r=otan a=angelapwen Follow-up to the RFC at #55778. This addresses the SRE use case mentioned in #51643 — instead of moving forward with a global denylist as the RFC indicated, we are prototyping feature flags via cluster settings to turn on/off requested features. The first part of the prototype will be done on `BACKUP` and `RESTORE` commands. See [this doc](https://docs.google.com/document/d/1nZSdcK7YprL0P4TAuseY-mvlYnd82IaJ_ptAQDoWB6o/edit?) for further details. Note that the logic test under `ccl/backupccl/testdata/backup-restore/feature-flags` can be tested with the command `make test PKG=./pkg/ccl/backupccl TESTS='TestBackupRestoreDataDriven'` — Commit message below — Adds a cluster setting to toggle a feature flag for the BACKUP and RESTORE commands off and on. The feature is being introduced to address a Cockroach Cloud SRE use case: needing to disable certain categories of features in case of cluster failure. Release note (enterprise change): Adds cluster settings to enable/ disable the BACKUP and RESTORE commands. If a user attempts to use these features while they are disabled, an error indicating that the database administrator has disabled the feature is surfaced. Example usage for the database administrator: SET CLUSTER SETTING feature.backup.enabled = FALSE; SET CLUSTER SETTING feature.backup.enabled = TRUE; SET CLUSTER SETTING feature.restore.enabled = FALSE; SET CLUSTER SETTING feature.restore.enabled = TRUE; 56591: ui: fix Overview screen in OSS builds r=dhartunian a=davepacheco Previously, when using OSS builds (created with `make buildoss`), when you loading the DB Console in your browser, you'd get "Page Not Found". The route for the overview page was missing the leading '/'. This bug appears to have been introduced in 722c932. Release note (admin ui change): This fixes a bug where users of the OSS builds of CockroachDB would see "Page Not Found" when loading the Console. 56631: kv: remove assertion around QuotaPool and Raft window size config r=nvanbenschoten a=nvanbenschoten This check was well intentioned when added, but seems misguided in retrospect. There are valid reasons why someone might want to configure limits to various dimensions of the Raft window size in ways that, when combined, exceeds the total window size limit (i.e. the quota pool size). Co-authored-by: angelapwen <[email protected]> Co-authored-by: Joshua M. Clulow <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>
Hi everyone 👋 as my internship winds down and I look to presenting on Thursday, I am providing an update here via comment on the status of the project and our design decisions thus far. As the current implementation was much simpler and did not require many comments, I am writing down the relevant notes here as opposed to another separate RFC. Current Implementation — Feature FlagsPrimary DifferencesInstead of a denylist, we will move forward with an MVP of a feature flag-based way to block SQL queries based off of recent conversations in #tmp-sql-denylist. It is estimated that this mechanism will block >90% of the queries that SREs have needed to be blocked in the past and a finer-grained regex denylist is not indicated at this time. Additionally it seems that there will be a longer path to engineering consensus around designs for the regex denylist, and this implementation will allow us to more quickly address the most pressing issue at hand. The feature flag implementation means that we do not have need for a ‘staging’ version anymore, as there should not be ambiguity about whether a feature flag correctly matches the feature to be turned off (as compared to regular expression matching). Another difference between this proposal and the previous one is that there will be a separation of concerns between what the SREs will handle on the Cockroach Cloud side vs. the database engineers on CRDB. Originally it was proposed that we would need a file-based mechanism so that SREs would be able to implement a denylist when clusters are unresponsive; but the consensus on #tmp-sql-denylist has been that a better design (proposed by @andy-kimball) would be for SREs to deny access for the particular user triggering the unresponsiveness; implement the query denying on the SQL side; and then allow access for the user again. This would allow for greater decoupling between the database and CC SRE work, and allows us to move forward using cluster settings for implementation. Storage Mechanism — Cluster SettingsWe use cluster settings to load in the flags we want to turn on. Every time the relevant cluster setting is set or changed, all nodes can parse the new configuration and store which features to block in a cache. Once the new configuration is cached/stored, the feature flag will be checked during the planning phase of command execution. Our MVP includes boolean flags for each feature or feature category. So far we have: A sample PR for basic implementation on Metrics and TelemetryA telemetry counter Future ExtensionsAdding hierarchy to featuresThis option allows even greater ability to set and unset entire categories of features, such as schema changes; while maintaining the ability to set and unset more granular features as well. Syntactic SugarThis extension would wrap the same functionality mentioned above into a cleaner, more intuitive CLI command so that operators would not need to learn syntax around setting cluster settings. Future ConsiderationsWe consider many features out of scope for this MVP, but do want to consider how extending the MVP to allow for those features affect our design. These include:
I hope this is useful for any future readers who may be interested in expanding upon the feature flags or moving forward with the denylist! ✨ |
Thank you for the writeup (& the feature flags)!! |
do you mind replacing the RFC with the write up, let's merge this (so it's off my stuff to review list :P ) |
I asked angela to make the rfc. If you cannot review it then I'll do it. |
Ah the bit in the comment just needs to be translated into the markdown and we should be good. |
(no rush tho!) |
Oh, sorry I think I had misunderstood. I had thought we were going to leave this RFC PR 'open' until/if a denylist or file-mounted path were to be revisited. To clarify, I should replace the original markdown file on denylist with the path we ended up going (summarized via the last comment) and merge it? @otan @knz |
There are two possible situations:
|
I see! I will do the second option as I believe they are quite fundamentally different designs and would benefit from having two separate docs. |
angelapwen seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
Co-Authored-By: Oliver Tan [email protected]
View text at: https://github.com/angelapwen/cockroach/blob/denylist-rfc/docs/RFCS/20201020_sql_query_denylist.md
Thank you in advance for the reviews! ✨