-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql/pgwire: implement pgwire query cancellation #67501
Conversation
pkg/server/serverpb/status.proto
Outdated
@@ -1185,6 +1209,14 @@ service Status { | |||
}; | |||
} | |||
|
|||
// PGWireCancelQuery cancels a SQL query given its pgwire "secret ID". | |||
rpc PGWireCancelQuery(PGWireCancelQueryRequest) returns (PGWireCancelQueryResponse) { | |||
option (google.api.http) = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i got some pushback on making this available as a http endpoint and to remove it in a different PR, wonder if you'll be told the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah interesting, let us see what people think.. I put it here since the other CancelQuery endpoint is in here, but open to hearing alternatives.
// Response returned by target query's gateway node for a pgwire cancel request. | ||
message PGWireCancelQueryResponse { | ||
// Whether the cancellation request succeeded and the query was canceled. | ||
bool canceled = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: i think it's more common to have something called success
, but up to you.
would it be more sane to have this have no response field but a grpc error if this fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me, but I was matching the response format of the non-pgwire CancelQueryResponse. I lean in favor of matching that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mm, up to you.
pkg/sql/pgwire/conn_test.go
Outdated
// Cancellation has no effect on ongoing query. | ||
if _, err := db.QueryContext(cancelCtx, "select pg_sleep(0)"); err != nil { | ||
// Cancellation should stop the query. | ||
if _, err := db.QueryContext(cancelCtx, "select pg_sleep(1)"); err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would sleep longer (30s?) in case of, e.g., race.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
pkg/sql/pgwire/server.go
Outdated
resp, err = s.execCfg.SQLStatusServer.PGWireCancelQuery(ctx, req) | ||
if err == nil && len(resp.Error) > 0 { | ||
err = fmt.Errorf(resp.Error) | ||
} | ||
telemetry.Inc(sqltelemetry.CancelRequestCounter) | ||
_ = conn.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, why do we close the current connection? wouldn't this render the session that issued the cancel unusable in subsequent requests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well the session that issued the cancel is not really a SQL session -- it's an entirely new connection that's created just to send the cancel. so there isn't anything left for the server to do after it handles the pgwire-cancel.
also, seems like lib/pq actually relies on the server closing the connection. see https://github.com/lib/pq/blob/9e747ca50601fcb6c958dd89f4cb8aea3e067767/conn_go18.go#L169-L172
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe worth a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, probably want a review from the server team too
pkg/sql/pgwire/server.go
Outdated
resp, err = s.execCfg.SQLStatusServer.PGWireCancelQuery(ctx, req) | ||
if err == nil && len(resp.Error) > 0 { | ||
err = fmt.Errorf(resp.Error) | ||
} | ||
telemetry.Inc(sqltelemetry.CancelRequestCounter) | ||
_ = conn.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe worth a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All right after looking at this we need to have two discussions
Discussion about the encoding and the secret entropy
I am not fundamentally opposed to the approach but the design discussion should be reported on in the commit message (I am mildly surprised there was no RFC). For example, it could include answers to the following questions:
- If a node has 10000 sessions open, there's only 19 bits of entropy to protect us against duplicate IDs or an attacker guessing a valid session ID by chance. are we comfortable with this?
- the chance that that session's IDs gets allocated twice increases over time. for long-lived sessions (which is not unreasonable for SQL), then the chance of duplicates increases with the age of the session. What does the math say?
- what does pg do? Are we doing something similar? (If not, why? what was the trade-off?)
- what's the probability chance of a duplicate ID as a function of the concurrent number of session and their age?
- what happens if there are, in fact, duplicate IDs, and a cancel is issued? How would an operator notice that multiple sessions were cancelled?
- what is the % chance for an attacker to guess a session ID as a function of the number of open sessions?
Also I observe the solution described here uses a full 32-bit value to describe the node ID. In practice, it's somewhat unlikely we ever need all 32 bits. I wonder if we could get some mileage by letting the two values overlap somehow, and then doing a fan-out across multiple nodes to "find the session" if there is an ambiguity?
Alternatively, we can use a variable encoding. For example, we could use a scheme looking as follows:
- designate one bit in the "node ID part" (e.g. bit 31) to indicate whether the field contains a "large node ID" or a "small node ID". Based on our deployment experience, a node ID is "small" if it is smaller than 1024 (i.e. 10 bits).
- based on that idea, use the field as follows: if bit 31 is unset (small node ID), then use bits 30-11 as additional secret bits. Otherwise, consider the node ID is large and then keep the secret at 32 bit.
- when processing a cancellation request, look at bit 31 to decide how to compute the node ID from the first 32 bits.
Why this works and in fact makes for better security:
- in the common case of small node IDs, we get more secret bits. This is good! It lowers the chance of collisions and random guesses by increasing the entropy to 51 bits.
- in the less common case of large node IDs, it is also/still good: in that case, the node IDs are large and so an attacker must also guess the node ID accurately (we get extra entropy from the node IDs themselves).
(We could even devise an even better scheme using variable-length encoding for the node ID and use all remaining bits as secret, so there's no sharp cut off when node IDs become larger than 1024).
Anyway, one thing to take from the convo above is that in the implementation the data type used to represent a node+secret combination should be abstracted into a separate Go type, with methods to retrieve the node ID and secret part, and (conceptually, in the API) treat the secret part as an int64 regardless of implementation. This will ensure the implementation is not painted into a corner.
Also, the file where the data type is implemented should be the place where the scheme is documented inside the source code (as a comment). Then the other places that implement the cancellation should forward the reader to that central place where the thing is documented. Given the sensitive nature of this code, we don't want multiple locations in the code making assumptions about the implementation and so all the comments should forward to one source of truth.
Finally, the specific choices and the ideas for the scheme should be included in a release note with label "security update".
Discussion about rate limiting
The overall idea to use a semaphore with a delay in case of failed cancellation is good. I recommend explaining this in a release note marked with "security update" too.
However, the following comment reveals something problematic:
// The request is forwarded to the appropriate node. We implement the rate
// limit in the node to which the request is forwarded. This way, if an
// attacker spams all the nodes in the cluster with requests that all go to
// the same node, the per-node rate limit will prevent them from having
// too many guesses.
It's important to have a semaphore at the destination, however the mechanism you're introducing here is creating a DDoS attack surface that did not exist before: it becomes possible for an attacker to flood the network with this. It's even a bit worse than that, as it will really flood the inter-node gRPC connections that are also used for KV traffic and cause general cluster instability.
I think we need two semaphores, one at the source, per node, to protect against what I wrote; and one at the destination, for the reason you wrote above.
Reviewed 4 of 4 files at r1, 7 of 7 files at r2, 8 of 9 files at r3, 1 of 1 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @otan, and @rafiss)
pkg/server/serverpb/status.proto, line 649 at r2 (raw file):
// node_id is a string so that "local" can be used to specify that no // forwarding is necessary. string node_id = 1;
I don't think requesters will ever use "local" so you can keep an int here (specifically, an in32 and use gogo.casttype to roachpb.NodeID).
pkg/server/serverpb/status.proto, line 652 at r2 (raw file):
// "Secret" ID of query to be canceled. This comes from the BackendKeyData // generated during the pgwrite protocol flow. uint32 secret_id = 2 [(gogoproto.customname) = "SecretID" ];
see discussion at top: probably worth implementing this as uint64 and keep the specifics encapsulated where the node/secret pair type is implemented in SQL.
pkg/server/serverpb/status.proto, line 1214 at r2 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
Ah interesting, let us see what people think.. I put it here since the other CancelQuery endpoint is in here, but open to hearing alternatives.
Yes indeed this functionality does not warrant a HTTP endpoint.
pkg/sql/pgwire/conn.go, line 722 at r1 (raw file):
// connection. The cancel code and secret are both int32's. The cancel code // is supposed to be the connection's PID, but since we don't have one of // those, we use the node ID and a random 32-bit ID.
Maybe change this comment into a link to another file elsewhere where the full scheme is documented.
Also, I forgot the most important discussion of them all How is this supposed to work with CockroachCloud?Especially, serverless clusters, which hopefully will constitute the majority of our CC users. Reminder, serverless uses a "SQL proxy" which takes connections from customers on a single hostname ( Obviously, the current username/password-based routing simply cannot work with the pg cancellation protocol. What to do instead? Here are the parameters of the problem:
There's a perhaps naive approach to side-step these questions, by asking: "Would it be OK for the cancellation protocol to only work for on-prem and dedicated clusters?" To which there are two answers:
What to do about it? I think we could perhaps make-do by not forwarding the pg cancel request in the sql proxy, and handling cancellation inside the SQL proxy instead. In that view, the cancellation ID would identify a session inside a SQL proxy, not inside a SQL tenant. (We'd use SQL proxy ID, session ID inside proxy, as cancellation key). Then when a SQL proxy receives a valid cancel request, it could either send a proper CANCEL SESSION statement to the back-end SQL tenant, or perhaps even more simply close the back-end TCP connection. |
In the hypothetical case where cancellation for CC can indeed be handled inside the SQL proxy without participation from CockroachDB, then I wonder if we should not implement that first -- it would have higher impact on our business. |
Another thing, in the case this is meant to work db-side for CC server less clusters. How does this work with session migration? Reminder, we're aiming to hold idle SQL sessions open in the proxy while shutting down the SQL pod. When the session becomes active again, a new SQL pod is started. However the new SQL pod likely gets a new instance ID! I think in serverless clusters we can't really assume that the SQL instance ID is stable across the lifetime of the SQL session. Andy K might be able to inform further. How would cancellation work in that case? |
And then one more question, if we evolve our product offering to recommend a CRDB SDK to develop clients, could we perhaps implement cancellation in the SDK with a CANCEL SESSION statement? |
Could we use the IP address of the caller to increase security? For example, we could have the requirement that only the IP address that originally established the authenticated session is allowed to cancel active queries in that session. While something like this would be essential to do for Serverless clusters, I'd think that'd be a good check even for Dedicated clusters. Also, for Serverless, could we use the IPv4 address of the SQL instance instead of the NodeID of the CRDB node as the |
I also agree with Raphael that this feature could use an RFC, given the complexity of the architectural considerations in play here. |
I will work on providiong answers to these questions. They're good to ask. Specifically for "Discussion about the encoding and secret entropy" the decisions I made were based on the conclusions of the large thread here: #34520 (comment) But I agree I should summarize and present those conclusions in these commit messages. I didn't write an RFC since in my mind, the discussion already happened, but looks like there's still enough demand for an RFC (esp given that Serverless was not considered in the previous discussion). I did consider adding a second semaphore, but kept going back and forth on that. Your comment assures me that it's a good idea to add. Good points about CC-Serverless. This approach I've implemented will not work with that or if we do session migration in other places. I'm not advocating for keeping it not working, but I will also just point out that the pgwire cancel protocol is "best effort." specifically, from the PG docs:
And also for some context, the big motivation for us to pick up this work is that it is the top unimplemented feature in our telemetry. cc @vy-ton in case there is more info about the priorities and for awareness on the rest of the discussion |
I'd like to semi-seriously ask: if pg considers this feature best effort, what is the difference being doing something that's only going to work for a minority of clusters, and correctly doing absolutely nothing in response to a cancellation request? Is it perhaps easier to ... remove the telemetry counter altogether? |
Another even simpler idea: just use the two fields in This method would work for both Dedicated and Serverless clusters, be relatively easy to implement, would be reasonably secure, and should be no worse than |
Andy, you're forgetting that these requests are not authenticated. In serverless, that means a proxy would need to broadcast to all tenants too. |
The existing cockroach/pkg/server/status.go Lines 2213 to 2222 in fdb86a2
For tenants, cockroach/pkg/server/tenant_status.go Line 91 in fdb86a2
This latter point makes me feel that this PR is not the place for us to make |
Could sqlproxy also use the source IP of the cancel request to decide which tenant to broadcast to? |
Ah, I'd thought someone said we broadcast when canceling - I was mistaken. And Raphael's right that it's way too expensive to broadcast to every tenant anyway. The problem with using the source IP to decide upon the tenant is that every SQL proxy won't necessarily have knowledge of the mapping from source IP to tenant(s), because not every source IP would be connected through every proxy. We'd have to coordinate among SQL proxies to distribute the information, which would introduce complexity and fragility. One thing that every proxy does have: a complete directory of the TenantID => TenantPodList for every tenant with active pods (though there's a short pub-sub delay to be fully fresh). We could take a hash of the pod address (IP/host + port) to fit even IPv6 addresses within the 32-bit
One downside of this design is that there are race conditions where the proxy directory is stale. But this seems a small downside, since mostly callers would cancel longer-running queries and also since |
Telemetry does indicate that this unsupported feature contributes to user confusion (todo docs issue). We also see TS repeatedly answering this question. Is it accurate to say that we do want to support
|
I'd be worried that the design for Serverless might change how we want to design for Dedicated. In that case, it'd be a mistake to merge this now, since it'd mean possibly painting ourselves into a corner later on when we generalize the design. If we've done enough thinking and have written down how this design can be extended in the future to handle Serverless, then I think that's sufficient due-diligence to merge this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks this looks nice now.
I think it's pretty important to have two separate metrics:
- cancel count (I think we track this already)
- ignored cancel queries due to semaphore (we'll need this to troubleshoot cancel issues under load)
- cancel queries processed successfully
Reviewed 1 of 20 files at r5, 4 of 23 files at r11, 4 of 6 files at r12, 14 of 15 files at r13.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @knz, @otan, and @rafiss)
pkg/server/tenant_status.go, line 253 at r13 (raw file):
resp = &serverpb.CancelQueryByKeyResponse{} // Acquiring the semaphore here helps protect both the source and destination
I'm not completely clear on the semaphore behavior.
Why do you use a semaphore both on the ingress and internally? Why is a semaphore only on the ingress insufficient?
Also I'm not completely clear on what are the consequences of using TryAcquire instead of regular Acquire. Does this mean that if the cluster is overloaded, the query requests get ignored?
I think that the desired behavior is this:
- TryAcquire on the ingress from sql clients, and drop excess requests, but with a high(er) semaphore count, perhaps 256
- Acquire (and wait if necessary) on the inter-node RPC, so that requests that have been accepted externally all get a chance to be processed (albeit, possibly, with a delay)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call on metrics. will add that as another commit.
i don't think we actually have any metrics now -- we just have product telemetry on the count of requests
Dismissed @knz and @otan from 6 discussions.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @knz, and @otan)
pkg/server/tenant_status.go, line 253 at r13 (raw file):
Why do you use a semaphore both on the ingress and internally? Why is a semaphore only on the ingress insufficient?
I tried to explain that in the comment: "The destination node is protected so that if an attacker spams all the nodes in the cluster with requests that all go to the same node, this semaphore will prevent them from having too many guesses." More concretely, suppose you have a 100-node cluster. If you limit the ingress only, then each node is limited to processing X
requests per second. But if an attacker crafts the CancelRequests to all target the same SQLInstance, then that one instance would have to process 100*X
requests per second.
I don't think the proposal to to Acquire and Wait on the inter-node RPC would address this concern.
pkg/server/serverpb/status.proto, line 649 at r2 (raw file):
Previously, knz (kena) wrote…
I don't think requesters will ever use "local" so you can keep an int here (specifically, an in32 and use gogo.casttype to roachpb.NodeID).
Done.
pkg/server/serverpb/status.proto, line 652 at r2 (raw file):
Previously, knz (kena) wrote…
see discussion at top: probably worth implementing this as uint64 and keep the specifics encapsulated where the node/secret pair type is implemented in SQL.
Done.
pkg/server/serverpb/status.proto, line 658 at r2 (raw file):
Previously, otan (Oliver Tan) wrote…
mm, up to you.
Done.
pkg/server/serverpb/status.proto, line 1214 at r2 (raw file):
Previously, knz (kena) wrote…
Yes indeed this functionality does not warrant a HTTP endpoint.
Done.
pkg/server/serverpb/status.proto, line 853 at r10 (raw file):
Previously, knz (kena) wrote…
Would it make sense to break down the fields of the int64 into sub-fields? It may make the RPC code easier to read.
To me the term "backend keydata" is a pgwire-specific concept. Do we need to propagate it internally through the cluster?
I could even argue it does not need to propagate to the
sql
package. The more narrow we can keep the uses of the magic logic in the new pgwirecancel package, the more flexibility we'll have later to change/add new ways to use cancellation.The main reason for my concern is that the cancellation key is the only part that deserves to be (and should remain) opaque everywhere else than the code that checks it.
The sql instance ID, in contrast, should be prominent. We'll want that code that organizes RPCs (and pod-to-pod conns), which needs to be maintained over time by non-SQL experts, has all the information it needs (SQL instance ID for routing) and does not get "polluted" by concerns that the maintainer doesn't need to concern themselves about.
Done.
pkg/sql/pgwire/server.go, line 763 at r3 (raw file):
Previously, otan (Oliver Tan) wrote…
maybe worth a comment
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with nits, you can do metrics here or in a followup PR.
Reviewed 1 of 15 files at r13.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis, @knz, @otan, and @rafiss)
pkg/server/tenant_status.go, line 253 at r13 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
Why do you use a semaphore both on the ingress and internally? Why is a semaphore only on the ingress insufficient?
I tried to explain that in the comment: "The destination node is protected so that if an attacker spams all the nodes in the cluster with requests that all go to the same node, this semaphore will prevent them from having too many guesses." More concretely, suppose you have a 100-node cluster. If you limit the ingress only, then each node is limited to processing
X
requests per second. But if an attacker crafts the CancelRequests to all target the same SQLInstance, then that one instance would have to process100*X
requests per second.I don't think the proposal to to Acquire and Wait on the inter-node RPC would address this concern.
Can you add the example to the comment, thanks.
pkg/sql/exec_util.go, line 1884 at r13 (raw file):
id ClusterWideID, queryCancelKey pgwirecancel.BackendKeyData, s registrySession, ) { r.Lock()
nit: defer unlock, in case there's an error accessing the maps
pkg/sql/exec_util.go, line 1891 at r13 (raw file):
func (r *SessionRegistry) deregister(id ClusterWideID, queryCancelKey pgwirecancel.BackendKeyData) { r.Lock()
ditto
pkg/sql/pgwire/server.go, line 771 at r13 (raw file):
var resp *serverpb.CancelQueryByKeyResponse resp, err = s.execCfg.SQLStatusServer.CancelQueryByKey(ctx, req) if err == nil && len(resp.Error) > 0 {
if len(resp.Error) > 0 {
err = errors.CombineErrors(err, errors.Newf("...", resp.Error))
}
cc @matthewtodd no need to review, but you may be interested in the last commit of this PR, since it relates to the cancel endpoints in the status servers. see the RFC or DM me if you need more context. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dismissed @knz from 2 discussions.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis, @knz, @matthewtodd, and @otan)
pkg/server/tenant_status.go, line 253 at r13 (raw file):
Previously, knz (kena) wrote…
Can you add the example to the comment, thanks.
done
pkg/sql/exec_util.go, line 1884 at r13 (raw file):
Previously, knz (kena) wrote…
nit: defer unlock, in case there's an error accessing the maps
done
pkg/sql/exec_util.go, line 1891 at r13 (raw file):
Previously, knz (kena) wrote…
ditto
Done.
pkg/sql/pgwire/server.go, line 771 at r13 (raw file):
Previously, knz (kena) wrote…
if len(resp.Error) > 0 { err = errors.CombineErrors(err, errors.Newf("...", resp.Error)) }
fixed
This is an ID used for pgwire query cancellation. In Postgres it contains a process ID and a secret ID. Since we don't have process IDs, we use a SQLInstanceID instead, which will also allow us to forward these requests to the correct node. To maximize the number of bits used for the secret, we first check if the SQLInstanceID can fit into 11 bits. If not, the leading bit of the BackendKeyData is set, and the remaining 31 bits are used for the SQLInstanceID. We introduce another map in the session registry that is keyed by the BackendKeyData. No release note since this doesn't yet cause a user-facing change -- we previously were sending a BackendKeyData of 0 all the time. Release note: None
e249ed1
to
306bb9d
Compare
This will be used by the pgwire query cancelation protocol. Nothing in this commit uses the endpoint and it is not implemented yet, so there is no release note. Release note: None
306bb9d
to
07fbfe0
Compare
Release note (sql change): Added support for query cancellation via the pgwire protocol. CockroachDB will now respond to a pgwire cancellation by forwarding the request to the node that is running a particular query. That node will then cancel the query that is currently running in the session identified by the cancel request. The cancel request is made through the pgwire protocol when initializing a new connection. The client must first send 32 bits containing the integer 80877102, followed immediately by the 64-bit BackendKeyData message that the server sent to the client when the session was started. Most Postgres drivers handle this protocol already, so there's nothing for the end-user to do apart from calling the "cancel" function that their driver offers. See https://www.postgresql.org/docs/13/protocol-flow.html#id-1.10.5.7.9 Release note (security update): Added support for query cancellation via the pgwire protocol. Since this protocol is unauthenticated, there are a few precautions included. (1) The protocol requires that a 64-bit key is used to uniquely identify a session. Some of these bits are used to identify the CockroachDB node that owns the session. The rest of the bits are all random. If the node ID is small enough, then only 12 bits are used for the ID, and the remaining 52 bits are random. Otherwise, 32 bits are used for both the ID and the random secret. (2) A fixed per-node rate limit is used. There can only be at most 256 failed cancellation attempts per second. Any other cancel requests that exceed this rate are ignored. This makes it harder for an attacker to guess random cancellation keys. Specifically, if we assume a 32-bit secret and 256 concurrent sessions on a node, it would take 2^16 seconds (about 18 hours) for an attacker to be certain they have cancelled a query. (3) No response is returned for a cancel request. This makes it impossible for an attacker to know if their guesses are working. There are internal warning logs for unsuccessful attempts. Large numbers of these messages could indicate malicious activity.
07fbfe0
to
81fd3f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 6 files at r15, 5 of 16 files at r16.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis, @knz, @matthewtodd, @miretskiy, @otan, and @rafiss)
pkg/server/status.go, line 2472 at r16 (raw file):
client, err := s.dialNode(ctx, roachpb.NodeID(req.SQLInstanceID)) if err != nil { return nil, err
return nil, serverError(ctx, err)
pkg/server/tenant_status.go, line 295 at r16 (raw file):
statusClient, err := t.dialPod(ctx, req.SQLInstanceID, instance.InstanceAddr) if err != nil { return nil, err
return nil, serverError(ctx, err)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis, @knz, @matthewtodd, and @otan)
pkg/server/status.go, line 2472 at r16 (raw file):
Previously, knz (kena) wrote…
return nil, serverError(ctx, err)
i can't find the serverError
function. the other endpoints in this file aren't calling it either.
pkg/server/tenant_status.go, line 295 at r16 (raw file):
Previously, knz (kena) wrote…
return nil, serverError(ctx, err)
ditto - i can't find the serverError
function. the other endpoints in this file aren't calling it either.
thanks for the reviews and discussion all! bors r=knz |
🕐 Waiting for PR status (Github check) to be set, probably by CI. Bors will automatically try to run when all required PR statuses are set. |
Build succeeded: |
closes #41335
See individual commits for more details.
This adds a BackendKeyData struct which identifies a session and SQLInstance
running that session. Clients send it to cancel a request, and it's handled by a new
CancelQueryByBackendKeyData endpoint in the status server. The cancellation
gets forwarded to the correct node, and there is a per-node rate limit to
prevent a DoS attack.
Also, refer to the RFC: #75870
This PR only adds support for non-multitenant clusters. Refer to the RFC for additional
work needed for serverless/multitenant support.