Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multitenant: key out of tenant keyspace bounds - infinite retry loop #98822

Closed
ecwall opened this issue Mar 16, 2023 · 9 comments · Fixed by #99143
Closed

multitenant: key out of tenant keyspace bounds - infinite retry loop #98822

ecwall opened this issue Mar 16, 2023 · 9 comments · Fixed by #99143
Assignees
Labels
A-multitenancy Related to multi-tenancy branch-master Failures and bugs on the master branch. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@ecwall
Copy link
Contributor

ecwall commented Mar 16, 2023

While reviewing #97537 an existing problem was encountered:

Symptom
Passing in a key that is outside of a secondary tenant's keyspace into the new crdb_internal.ranges_in_span built-in results in an infinite retry loop:

SELECT * FROM crdb_internal.ranges_in_span('\x8b', '\x8c'); 

In the CLI this hangs until the query is canceled:

[email protected]:26257/movr> SELECT * FROM crdb_internal.ranges_in_span('\x8b', '\x8c');                                                                                                                                                                       
^C
attempting to cancel query...
ERROR: query execution canceled
SQLSTATE: 57014

RCA
This outer loop is the one that is retried infinitely:

for ctx.Err() == nil {

stream.Recv() returns an error that is not io.EOF:
image
which results in this break:


Possible fix
Categorize errors returned by stream.Recv() and return that error from Connector.NewIterator if it is a non-retryable client error like this.


Jira issue: CRDB-25537

Epic CRDB-23344

@ecwall ecwall added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-multitenancy Related to multi-tenancy T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Mar 16, 2023
@ecwall ecwall self-assigned this Mar 16, 2023
@knz
Copy link
Contributor

knz commented Mar 17, 2023

Excellent issue description.

I am seeing the following code in the TokenBucket API handler just below:

      log.Warningf(⋄, "error issuing TokenBucket RPC: %v", err)
      if grpcutil.IsAuthError(err) {
        // Authentication or authorization error. Propagate.
        return nil, err
      }

could we solve our problem by doing the same in NewIterator?

And while we're at it, do it in the other for loops as well.

@knz knz added the GA-blocker label Mar 17, 2023
@blathers-crl
Copy link

blathers-crl bot commented Mar 17, 2023

Hi @knz, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@knz knz added branch-master Failures and bugs on the master branch. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 labels Mar 17, 2023
@knz
Copy link
Contributor

knz commented Mar 17, 2023

We should also consider extending the response type protobuf inside NewIterator (and the other API handlers that have a similar structure) to use an Error payload to stop the loop, the same way as done in RangeLookup.

@ecwall
Copy link
Contributor Author

ecwall commented Mar 20, 2023

We should also consider extending the response type protobuf inside NewIterator (and the other API handlers that have a similar structure) to use an Error payload to stop the loop, the same way as done in RangeLookup.

Do you have insight into why RangeLookup is designed this way? It is not clear from this API why some errors are returned as nested errors inside the "success" response (resp.Error) and some errors are top level errors (err).

resp, err := client.RangeLookup(ctx, &kvpb.RangeLookupRequest{
	...
})
if err != nil {
	log.Warningf(ctx, "error issuing RangeLookup RPC: %v", err)
	if grpcutil.IsAuthError(err) {
		// Authentication or authorization error. Propagate.
		return nil, nil, err
	}
	// Soft RPC error. Drop client and retry.
	c.tryForgetClient(ctx, client)
	continue
}
if resp.Error != nil {
	// Hard logical error. Propagate.
	return nil, nil, resp.Error.GoError()
}

Also the response being "success" vs "error" seems to be enforced as a comment and not as a type union.

message RangeLookupResponse {
  repeated RangeDescriptor descriptors            = 1 [(gogoproto.nullable) = false];
  repeated RangeDescriptor prefetched_descriptors = 2 [(gogoproto.nullable) = false];
  // If non-nil, the other fields will be empty.
  kv.kvpb.Error error = 3;
}

If we are already forced to handle top-level errors and can use grpcutil.IsAuthError to distinguish between retriable and non-retriable, then what is the purpose of adding nested errors in the success response?

@knz
Copy link
Contributor

knz commented Mar 20, 2023

The Error payload inside the protobuf is an applicaiton-level error, after authentication/authorization has succeeded. It can contain details about the application.

The "outer" gRPC error reflects errors happening in the communication between client and server, and can be reported before authentication/authorization and thus must not contain application details.

@ecwall
Copy link
Contributor Author

ecwall commented Mar 20, 2023

It looks like application-level errors are supported in gRPC status codes - https://grpc.github.io/grpc/core/md_doc_statuscodes.html

I think the tenant trying to access keys outside of its keyspace would be OUT_OF_RANGE.

Is there a reason to not use a status code like this?

@knz
Copy link
Contributor

knz commented Mar 20, 2023

Is there a reason to not use a status code like this?

Yes - grpc status codes/errors can't have a payload, so we don't get things like stack traces, redactable strings etc.

(We have a separate project to extend grpc error reporting to use our github.com/cockroachdb/errors protobuf to carry this data, but it's not done yet)

@knz
Copy link
Contributor

knz commented Mar 20, 2023

Ok so I'm realizing that my previous comment was confusing.

What I meant in that comment is that we could consider having an Error protobuf payload to report KV-level errors in addition to reporting authentication/authz failures using gRPC statuses. We will need both eventually.

This means we will see something like this in the code for every tenant connector RPC in the future:

    if err != nil {
      if grpcutil.IsAuthError(err) {
        // gRPC-level Authentication or authorization error.  Propagate.
        return err
      }
      // Soft RPC error. Drop client and retry.
      c.tryForgetClient(ctx, client)
      continue
    }
    if resp.Error != nil {
      // Hard logical error. Propagate.
      return resp.Error.GoError()
    }

@ecwall
Copy link
Contributor Author

ecwall commented Mar 21, 2023

I created a new issue for connector code improvement so that this issue can be closed as a ga-blocker: #99144

craig bot pushed a commit that referenced this issue Mar 22, 2023
98640: server: add `node_id` label to _status/vars output r=aadityasondhi a=dhartunian

Previously, the output of the prometheus metrics via `_status/ vars` did not include any node labels. This caused challenges for customers who want to monitor large clusters as it requires additional configuration on the scrape- side to ensure a node ID is added to the metrics. This can be challenging to deal with when nodes come and go in a cluster and the scrape configuration must change as well.

This change adds a `node_id` prometheus label to the metrics we output that matches the current node's ID. Since `_status/vars` is output from a single node there is only ever one single value that's appropriate here.

Secondary tenants will mark their metrics with either the nodeID of the shared- process system tenant, or the instanceID of the tenant process.

Resolves: #94763
Epic: None

Release note (ops change): Prometheus metrics available at the `_status/vars` path now contain a `node_id` label that identifies the node they were scraped from.

99143: multitenant: NewIterator connector infinite retry loop r=stevendanna a=ecwall

Fixes #98822

This change fixes an infinite retry loop in `Connector.NewIterator` that would occur when the `GetRangeDescriptors` stream returned an auth error. An example trigger would be passing in a span that was outside of the calling tenant's keyspace.

Now `NewIterator` correctly propagates auth errors up to the caller.

Release note: None

Co-authored-by: David Hartunian <[email protected]>
Co-authored-by: Evan Wall <[email protected]>
@craig craig bot closed this as completed in f3b8baf Mar 22, 2023
blathers-crl bot pushed a commit that referenced this issue Mar 22, 2023
Fixes #98822

This change fixes an infinite retry loop in `Connector.NewIterator` that would
occur when the `GetRangeDescriptors` stream returned an auth error. An example
trigger would be passing in a span that was outside of the calling tenant's
keyspace.

Now `NewIterator` correctly propagates auth errors up to the caller.

Release note: None
xinhaoz pushed a commit to xinhaoz/cockroach that referenced this issue Mar 30, 2023
Fixes cockroachdb#98822

This change fixes an infinite retry loop in `Connector.NewIterator` that would
occur when the `GetRangeDescriptors` stream returned an auth error. An example
trigger would be passing in a span that was outside of the calling tenant's
keyspace.

Now `NewIterator` correctly propagates auth errors up to the caller.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-multitenancy Related to multi-tenancy branch-master Failures and bugs on the master branch. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants