Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: wrap stacktraceless errors with errors.Wrap #96659

Merged
merged 1 commit into from
Feb 8, 2023
Merged

sql: wrap stacktraceless errors with errors.Wrap #96659

merged 1 commit into from
Feb 8, 2023

Conversation

ecwall
Copy link
Contributor

@ecwall ecwall commented Feb 6, 2023

Fixes #95794

This replaces the previous attempt to add logging here #95797.

The context itself cannot be augmented to add a stack trace to errors because
it interferes with grpc timeout logic - gRPC compares errors directly without
checking causes https://github.com/grpc/grpc-go/blob/v1.46.0/rpc_util.go#L833.
Although the method signature allows it, Context.Err() should not be
overriden to customize the error:

// If Done is not yet closed, Err returns nil.
// If Done is closed, Err returns a non-nil error explaining why:
// Canceled if the context was canceled
// or DeadlineExceeded if the context's deadline passed.
// After Err returns a non-nil error, successive calls to Err return the same error.
Err() error

Additionally, a child context of the augmented context may end up being used
which will circumvent the stack trace capture.

This change instead wraps errors.Wrap in a few places that might end up
helping debug the original problem:

  1. Where we call Context.Err() directly.
  2. Where gRPC returns an error after possibly calling Context.Err()
    internally or returns an error that does not have a stack trace.

Release note: None

@ecwall ecwall requested a review from rafiss February 6, 2023 17:37
@ecwall ecwall requested review from a team as code owners February 6, 2023 17:37
@blathers-crl
Copy link

blathers-crl bot commented Feb 6, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ecwall and @rafiss)


-- commits line 8 at r1:
nit: talk primarily about the contract of Context.Err().


-- commits line 11 at r1:
Can you try to also put messages on many of the points you touch? (i.e. errors.Wrap{f}()).


pkg/roachpb/errors.go line 323 at r1 (raw file):

				// TODO(tbg): revisit this unintuitive error wrapping here and see if
				// a better solution can be found.
				return errors.WithStack(&UnhandledRetryableError{

I think the changes to this function are dubious. We're not interested in the stack trace of the caller to GoError.
I think you want to capture a stack trace when this Error is returned from a Batch RPC. In fact, in the interesting case of a timeout error returned from client-side gRPC, I think you'll already have a better stack trace in the error given your changes to TracingInternalClient (I think DistSender takes the error returned by the RPC (and now modified in TracingInternalClient) and turns it into an roachpb.Error using roachpb.NewError(err))) , and in that case adding another one here I think can only hurt.
I'd try to capture the backtrace around grpcTransport (and overwrite EncodedError). Btw, I think including the request in the error (ba.String()) would also be useful.

If you're not attracted to doing that, I'd suggest we don't touch this code at all, as it's not relevant for cancelation or timeout.


pkg/rpc/nodedialer/nodedialer.go line 321 at r2 (raw file):

	resp, err := tic.InternalClient.Batch(ctx, ba, opts...)
	if err != nil {
		return nil, errors.WithStack(err)

If we do something for the BatchResponse in grpcTransport like I suggest elsewhere, I think maybe wrapping this err should also move there for symmetry. In fact even without those other changes, I think this TracingInternalClient would be better left alone to deal with one thing only, and lift the error mangling up.


pkg/util/tracing/grpcinterceptor/grpc_interceptor.go line 97 at r1 (raw file):

		if methodExcludedFromTracing(info.FullMethod) {
			resp, err := handler(ctx, req)
			return resp, errors.WithStack(err)

The server-side stack that you're capturing here will always look the same (at least when the internalClientAdapter is not used)... Also I don't think it will make it across gRPC; in fact I suspect you might break the grpc return codes.
I don't think you want to capture it here.


pkg/util/tracing/grpcinterceptor/grpc_interceptor.go line 229 at r2 (raw file):

		_, localRequest := grpcutil.IsLocalRequestContext(ctx)
		if localRequest {
			return errors.WithStack(invoker(ctx, method, req, resp, cc, opts...))

I wouldn't put more logic into this tracing interceptor. I'd rather we add another interceptor around here.
Is there a high bar to doing that for streaming RPCs in particular? If so, I'd leave those alone, without the error wrapping.

@ecwall ecwall changed the title sql: wrap errors with errors.WithStack sql: wrap stacktraceless errors with errors.Wrap Feb 6, 2023
@ecwall ecwall requested a review from andreimatei February 6, 2023 22:42
Copy link
Contributor Author

@ecwall ecwall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @rafiss)


-- commits line 8 at r1:

Previously, andreimatei (Andrei Matei) wrote…

nit: talk primarily about the contract of Context.Err().

Added some more details.


-- commits line 11 at r1:

Previously, andreimatei (Andrei Matei) wrote…

Can you try to also put messages on many of the points you touch? (i.e. errors.Wrap{f}()).

Yeah I changed WithStack -> Wrap.


pkg/rpc/nodedialer/nodedialer.go line 321 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

If we do something for the BatchResponse in grpcTransport like I suggest elsewhere, I think maybe wrapping this err should also move there for symmetry. In fact even without those other changes, I think this TracingInternalClient would be better left alone to deal with one thing only, and lift the error mangling up.

Reverted this back.


pkg/util/tracing/grpcinterceptor/grpc_interceptor.go line 97 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

The server-side stack that you're capturing here will always look the same (at least when the internalClientAdapter is not used)... Also I don't think it will make it across gRPC; in fact I suspect you might break the grpc return codes.
I don't think you want to capture it here.

Removed changes in the interceptors.


pkg/util/tracing/grpcinterceptor/grpc_interceptor.go line 229 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I wouldn't put more logic into this tracing interceptor. I'd rather we add another interceptor around here.
Is there a high bar to doing that for streaming RPCs in particular? If so, I'd leave those alone, without the error wrapping.

I moved the changes here to the spot you linked to and conditionally wrap if it is not a status error to avoid interfering with the caller. I am not sure if we need this or not... can stacktraceless errors come from here?


pkg/roachpb/errors.go line 323 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think the changes to this function are dubious. We're not interested in the stack trace of the caller to GoError.
I think you want to capture a stack trace when this Error is returned from a Batch RPC. In fact, in the interesting case of a timeout error returned from client-side gRPC, I think you'll already have a better stack trace in the error given your changes to TracingInternalClient (I think DistSender takes the error returned by the RPC (and now modified in TracingInternalClient) and turns it into an roachpb.Error using roachpb.NewError(err))) , and in that case adding another one here I think can only hurt.
I'd try to capture the backtrace around grpcTransport (and overwrite EncodedError). Btw, I think including the request in the error (ba.String()) would also be useful.

If you're not attracted to doing that, I'd suggest we don't touch this code at all, as it's not relevant for cancelation or timeout.

Reverted this back.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ecwall and @rafiss)


-- commits line 11 at r1:

Previously, ecwall (Evan Wall) wrote…

Yeah I changed WithStack -> Wrap.

:lgtm:

___ *[`pkg/ccl/kvccl/kvtenantccl/connector.go` line 464 at r6](https://reviewable.io/reviews/cockroachdb/cockroach/96659#-NNia1AZ4MGIvcHVIIgU:-NNia1AZ4MGIvcHVIIgV:b-mtdg6l) ([raw file](https://github.com/cockroachdb/cockroach/blob/063dd3b3cff20afdbcebcd78af8bfdce7b907207/pkg/ccl/kvccl/kvtenantccl/connector.go#L464)):* > ```Go > return resp.Descriptors, resp.PrefetchedDescriptors, nil > } > return nil, nil, errors.Wrap(ctx.Err(), "range lookup context error") > ```

nit: consider scratching "context error" from everywhere; I think the message of the two context errors is very recognizable


pkg/kv/kvclient/kvcoord/transport.go line 238 at r6 (raw file):

		span.ImportRemoteRecording(reply.CollectedSpans)
	}
	return reply, errors.Wrapf(err, "send batch error %s", ba.String())

Let's do the ba.String() only if err != nil; it does cost.


pkg/kv/kvclient/kvcoord/transport.go line 238 at r6 (raw file):

		span.ImportRemoteRecording(reply.CollectedSpans)
	}
	return reply, errors.Wrapf(err, "send batch error %s", ba.String())

let's try to move the ba before the word "error", otherwise it'll be confusing where the error message starts.
Maybe ba: %s RPC error.


pkg/rpc/context.go line 674 at r6 (raw file):

				if !ok {
					// Wrapping status errors interferes with caller.
					err = errors.Wrap(err, "client interceptor error")

this message is confusing; the error is not coming from "client interceptor"; it's coming from an RPC.


pkg/rpc/context.go line 682 at r6 (raw file):

		rpcCtx.clientStreamInterceptors = append(
			rpcCtx.clientStreamInterceptors,
			func(ctx context.Context, desc *grpc.StreamDesc, cc *grpc.ClientConn, method string, streamer grpc.Streamer, opts ...grpc.CallOption) (grpc.ClientStream, error) {

If it was me, I'd leave streaming RPCs alone. Usually the timeouts around them are encountered when reading or writing to the stream, not when initiating the call (i.e. here). You don't go through the trouble of providing a custom implementing a stream implementation that deals with that, so I think adding this code here doesn't serve much.


pkg/server/init.go line 480 at r6 (raw file):

		status, ok := grpcstatus.FromError(errors.UnwrapAll(err))
		if !ok {
			return nil, errors.Wrap(err, "non-grpc join error")

what is the "non-grpc" trying to tell me?
Maybe "failed to join cluster".


pkg/sql/internal_result_channel.go line 114 at r6 (raw file):

	ctx context.Context,
) (_ ieIteratorResult, done bool, err error) {
	const wrapMsg = "first result context error"

maybe "failed to read query result"


pkg/sql/internal_result_channel.go line 132 at r6 (raw file):

		return false, nil
	}
	const wrapMsg = "maybe unblock writer context error"

Up to you, but I feel like the amount of changes has gone overboard for little chance of actually helping anybody.
There's also some readability cost to lines like

	case <-i.doneCh:
		return true, errors.Wrap(ctx.Err(), wrapMsg)

Before, it was clear enough that the error returned could be nil; now that's less visible.

A lot of these call sites deal mostly with cancelations, not with timeouts. And for cancelation, I think the Cause(ctx) is what we want (once we also start using WithCancelCause).

So it it was me, I wouldn't blindly wrap every ctx.Err() that I can find.


pkg/util/tracing/grpcinterceptor/grpc_interceptor.go line 229 at r2 (raw file):

I am not sure if we need this or not... can stacktraceless errors come from here?

Sorry - what's "here"?


pkg/util/tracing/grpcinterceptor/grpc_interceptor.go line 253 at r6 (raw file):

				clientSpan.Recordf("error: %s", err)
			}
			return err

if you're taking the blame here, move this return in the if above`.

Copy link
Contributor Author

@ecwall ecwall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei and @rafiss)


pkg/ccl/kvccl/kvtenantccl/connector.go line 464 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: consider scratching "context error" from everywhere; I think the message of the two context errors is very recognizable

Removed


pkg/kv/kvclient/kvcoord/transport.go line 238 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Let's do the ba.String() only if err != nil; it does cost.

Done


pkg/kv/kvclient/kvcoord/transport.go line 238 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

let's try to move the ba before the word "error", otherwise it'll be confusing where the error message starts.
Maybe ba: %s RPC error.

Done


pkg/rpc/context.go line 674 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

this message is confusing; the error is not coming from "client interceptor"; it's coming from an RPC.

Removed this based on the comment below


pkg/rpc/context.go line 682 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

If it was me, I'd leave streaming RPCs alone. Usually the timeouts around them are encountered when reading or writing to the stream, not when initiating the call (i.e. here). You don't go through the trouble of providing a custom implementing a stream implementation that deals with that, so I think adding this code here doesn't serve much.

Removed these impls


pkg/server/init.go line 480 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

what is the "non-grpc" trying to tell me?
Maybe "failed to join cluster".

Done


pkg/sql/internal_result_channel.go line 114 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

maybe "failed to read query result"

Done


pkg/sql/internal_result_channel.go line 132 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Up to you, but I feel like the amount of changes has gone overboard for little chance of actually helping anybody.
There's also some readability cost to lines like

	case <-i.doneCh:
		return true, errors.Wrap(ctx.Err(), wrapMsg)

Before, it was clear enough that the error returned could be nil; now that's less visible.

A lot of these call sites deal mostly with cancelations, not with timeouts. And for cancelation, I think the Cause(ctx) is what we want (once we also start using WithCancelCause).

So it it was me, I wouldn't blindly wrap every ctx.Err() that I can find.

The reason I added these is that it looks like any of these could be where ctx.Err() is first discovered when using the InternalExecutor. I added a comment about errors.Wrap returning nil if nil is passed in.


pkg/util/tracing/grpcinterceptor/grpc_interceptor.go line 229 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I am not sure if we need this or not... can stacktraceless errors come from here?

Sorry - what's "here"?

I removed this anyways, but I meant from inside invoker.


pkg/util/tracing/grpcinterceptor/grpc_interceptor.go line 253 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

if you're taking the blame here, move this return in the if above`.

Done

Fixes #95794

This replaces the previous attempt to add logging here #95797.

The context itself cannot be augmented to add a stack trace to errors because
it interferes with grpc timeout logic - gRPC compares errors directly without
checking causes https://github.com/grpc/grpc-go/blob/v1.46.0/rpc_util.go#L833.
Although the method signature allows it, `Context.Err()` should not be
overriden to customize the error:
```
// If Done is not yet closed, Err returns nil.
// If Done is closed, Err returns a non-nil error explaining why:
// Canceled if the context was canceled
// or DeadlineExceeded if the context's deadline passed.
// After Err returns a non-nil error, successive calls to Err return the same error.
Err() error
```
Additionally, a child context of the augmented context may end up being used
which will circumvent the stack trace capture.

This change instead wraps `errors.Wrap` in a few places that might end up
helping debug the original problem:
1) Where we call `Context.Err()` directly.
2) Where gRPC returns an error after possibly calling `Context.Err()`
   internally or returns an error that does not have a stack trace.

Release note: None
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge away as far as I'm concerned.
I don't know about backporting, though. I'd rather we just revert the previous change on the other branches. Do we really need to backport anything?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ecwall and @rafiss)

Copy link
Contributor Author

@ecwall ecwall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already reverted the previous change in 22.1.x and 22.2.x.

These logging changes were driven by context.DeadlineExceeded errors that a customer was having in 22.1.x that we did not know the source of because of no stack trace: https://github.com/cockroachlabs/support/issues/2017

I think this PR should be backported otherwise 22.1.x and 22.2.x won't have stack traces for these errors.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei and @rafiss)

@ecwall
Copy link
Contributor Author

ecwall commented Feb 8, 2023

bors r=andreimatei

@craig
Copy link
Contributor

craig bot commented Feb 8, 2023

Build succeeded:

@craig craig bot merged commit 844a370 into cockroachdb:master Feb 8, 2023
@blathers-crl
Copy link

blathers-crl bot commented Feb 8, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 00fd3ef to blathers/backport-release-22.1-96659: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.


error creating merge commit from 00fd3ef to blathers/backport-release-22.2-96659: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you, but I feel like the amount of changes has gone overboard for little chance of actually helping anybody.

@andreimatei I see you write this below, and I'd like to talk about this a bit more. This PR skewed with some datadriven tests I'm writing and the way these errors look now is a bit verbose. Consider what I had to update my test to:

exec-sql-tenant
CREATE TABLE t(a INT)
----
ok

exec-privileged-op-tenant
ALTER TABLE t SPLIT AT VALUES (0)
----
pq: ba: AdminSplit [/Tenant/10/Table/104/1/0,/Min) RPC error: rpc error: code = Unauthenticated desc = tenant 10 does not have admin split capability

Do we absolutely need this change? Could this "original problem" the commit message references be instead solved by verbose logging, and selectively turning it on, or the like?

I'm worried adding such intense wrapping on so many error paths doesn't really help anyone in the long run.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: improve stack trace for get-user-timeout timeouts
4 participants