-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multitenant: key out of tenant keyspace bounds - infinite retry loop #98822
Comments
Excellent issue description. I am seeing the following code in the log.Warningf(⋄, "error issuing TokenBucket RPC: %v", err)
if grpcutil.IsAuthError(err) {
// Authentication or authorization error. Propagate.
return nil, err
} could we solve our problem by doing the same in And while we're at it, do it in the other |
Hi @knz, please add branch-* labels to identify which branch(es) this release-blocker affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
We should also consider extending the response type protobuf inside |
Do you have insight into why
Also the response being "success" vs "error" seems to be enforced as a comment and not as a type union.
If we are already forced to handle top-level errors and can use |
The The "outer" gRPC error reflects errors happening in the communication between client and server, and can be reported before authentication/authorization and thus must not contain application details. |
It looks like application-level errors are supported in gRPC status codes - https://grpc.github.io/grpc/core/md_doc_statuscodes.html I think the tenant trying to access keys outside of its keyspace would be Is there a reason to not use a status code like this? |
Yes - grpc status codes/errors can't have a payload, so we don't get things like stack traces, redactable strings etc. (We have a separate project to extend grpc error reporting to use our github.com/cockroachdb/errors protobuf to carry this data, but it's not done yet) |
Ok so I'm realizing that my previous comment was confusing. What I meant in that comment is that we could consider having an Error protobuf payload to report KV-level errors in addition to reporting authentication/authz failures using gRPC statuses. We will need both eventually. This means we will see something like this in the code for every tenant connector RPC in the future: if err != nil {
if grpcutil.IsAuthError(err) {
// gRPC-level Authentication or authorization error. Propagate.
return err
}
// Soft RPC error. Drop client and retry.
c.tryForgetClient(ctx, client)
continue
}
if resp.Error != nil {
// Hard logical error. Propagate.
return resp.Error.GoError()
} |
I created a new issue for connector code improvement so that this issue can be closed as a ga-blocker: #99144 |
98640: server: add `node_id` label to _status/vars output r=aadityasondhi a=dhartunian Previously, the output of the prometheus metrics via `_status/ vars` did not include any node labels. This caused challenges for customers who want to monitor large clusters as it requires additional configuration on the scrape- side to ensure a node ID is added to the metrics. This can be challenging to deal with when nodes come and go in a cluster and the scrape configuration must change as well. This change adds a `node_id` prometheus label to the metrics we output that matches the current node's ID. Since `_status/vars` is output from a single node there is only ever one single value that's appropriate here. Secondary tenants will mark their metrics with either the nodeID of the shared- process system tenant, or the instanceID of the tenant process. Resolves: #94763 Epic: None Release note (ops change): Prometheus metrics available at the `_status/vars` path now contain a `node_id` label that identifies the node they were scraped from. 99143: multitenant: NewIterator connector infinite retry loop r=stevendanna a=ecwall Fixes #98822 This change fixes an infinite retry loop in `Connector.NewIterator` that would occur when the `GetRangeDescriptors` stream returned an auth error. An example trigger would be passing in a span that was outside of the calling tenant's keyspace. Now `NewIterator` correctly propagates auth errors up to the caller. Release note: None Co-authored-by: David Hartunian <[email protected]> Co-authored-by: Evan Wall <[email protected]>
Fixes #98822 This change fixes an infinite retry loop in `Connector.NewIterator` that would occur when the `GetRangeDescriptors` stream returned an auth error. An example trigger would be passing in a span that was outside of the calling tenant's keyspace. Now `NewIterator` correctly propagates auth errors up to the caller. Release note: None
Fixes cockroachdb#98822 This change fixes an infinite retry loop in `Connector.NewIterator` that would occur when the `GetRangeDescriptors` stream returned an auth error. An example trigger would be passing in a span that was outside of the calling tenant's keyspace. Now `NewIterator` correctly propagates auth errors up to the caller. Release note: None
While reviewing #97537 an existing problem was encountered:
Symptom
Passing in a key that is outside of a secondary tenant's keyspace into the new
crdb_internal.ranges_in_span
built-in results in an infinite retry loop:In the CLI this hangs until the query is canceled:
RCA
This outer loop is the one that is retried infinitely:
cockroach/pkg/ccl/kvccl/kvtenantccl/connector.go
Line 511 in ce82ded
stream.Recv()
returns an error that is notio.EOF
:which results in this break:
cockroach/pkg/ccl/kvccl/kvtenantccl/connector.go
Line 543 in ce82ded
Possible fix
Categorize errors returned by
stream.Recv()
and return that error fromConnector.NewIterator
if it is a non-retryable client error like this.Jira issue: CRDB-25537
Epic CRDB-23344
The text was updated successfully, but these errors were encountered: