Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvclient: switch error in ctx cancel edge case #70159

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

andreimatei
Copy link
Contributor

This patch changes the error returned by the DistSender on a cancelled
ctx. Depending on the exactly who detects the ctx as cancelled, there
are a number of different possibilities - too many to enumerate here.
Generally, the client is supposed to get a context.Canceled error,
wrapped in different layers. The code path that this patch changes is
about the cancellation being detected by sendPartialBatch() without it
having been previously detected by sendToReplicas(). This path is
unusual (sendToReplicas() generally detects the ctx cancelled).
Depending on the exact cancellation timing, it's possible though for
sendPartialBatch() to detect it instead. In this case, this patch makes
it so that, if sendToReplicas() returned a sendError (indicating that
a suitable replica could not be reached and that the higher layer is
expected to continue trying other replicas), the error returned to the
client is a cancellation error instead of the sendError.

Touches #69419

Release note: None

@andreimatei andreimatei requested a review from a team as a code owner September 13, 2021 21:28
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg
Copy link
Member

tbg commented Sep 14, 2021

I've read the PR message twice, stared at the diff twice, and back. I still cannot figure out what difference this diff actually makes. Could you elucidate a bit more?

@tbg
Copy link
Member

tbg commented Sep 14, 2021

The idea is that if we get a sendError and the context is canceled, we need to hit the deduceEarlyExitRetryError line, right? So we populate pErr later. But if we hit a send error, don't we continue in the retry loop, i.e. we'll try to send again (in the common case) and then get a context canceled error back from that directly? So does it then only matter in the case in which this was the last retry? That's my understanding so far. Feels like there is some more comment to be written on the diff, to preserve these nuances for the readers to come.

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we hit a send error, don't we continue in the retry loop, i.e. we'll try to send again (in the common case) and then get a context canceled error back from that directly?

No. I think you're missing the fact that sendPartialBatch uses retry.StartWithCtx, so it won't try when the ctx is canceled.

The scenario the patch changes is about sendToReplicas returning a sendError, and the ctx also being canceled at this point. In that case, what we want is to take the deduceEarlyExitRetryError path. Before this patch, that path was not taken - the assignment of the pErr that this patch moves was inhibiting it. With the patch, we're no longer inhibiting it. Makes sense? If so, I'll reword the commit msg somehow.

This patch was prompted by some roachtest failures that Yahor was working on. Something in SQL was expecting a ctx canceled error, and it wasn't getting it (it was getting a sendError instead). It's not easy to trigger the case that the patch is addressing because sendToReplicas generally doesn't return a sendError if the ctx is canceled while sendToReplicas is running. I forgot if I convinced myself that that can, in fact, happen, or whether I believe that the only case where this patch makes a difference is when the ctx is canceled exactly as sendToReplicas is returning. Yahor's TPCC surprisingly seemed able to produce this scenario somewhat reliably.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it might be better to keep the current structure, but instead override pErr if the context was cancelled, thereby assuming that the context cancellation caused the error. E.g.:

	// Propagate error if either the retry closer or context done
	// channels were closed.
	if err := ds.deduceRetryEarlyExitError(ctx); err != nil {
			pErr = roachpb.NewError(err)
	} else if pErr == nil {
		log.Fatal(ctx, "exited retry loop without an error")
	}

At least to me, this expresses the intent better than the proposed patch: there was an error, but if the context was cancelled then that's the error we care about, regardless of how it manifested further down the stack.

Also, won't this patch end up with pErr = nil if the retries are exhausted and all of the errors were sendError? This will currently trigger the log.Fatal(), which would be bad.

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)

@andreimatei andreimatei force-pushed the small.DistSender-cancel-err branch from 9ddcc52 to 9e34fe0 Compare September 24, 2021 15:25
Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but instead override pErr if the context was cancelled, thereby assuming that the context cancellation caused the error

I don't want to do that; the simple fact that a context is canceled does not mean that the error you got has anything to do with the cancellation. Beyond the principle, in this particular case we don't want to overwrite an AmbiguousResultError.
Btw, I've also convinced the SQL module that caused this hubbub to not make assumptions about what errors it might get after a canceled ctx.

Also, won't this patch end up with pErr = nil if the retries are exhausted and all of the errors were sendError? This will currently trigger the log.Fatal(), which would be bad.

Well, the retries are infinite; only the closer or the ctx end them. Relying on infinite retries is a pattern in various places in the code base. For the DistSender, the retries are technically configurable, so I'll give you partial points; I took the opportunity to turn the fatal into an assertion error.
Btw, I should say that the current nested structure of the DistSender loops, with retries at multiple levels, is far from ideal. The code kept evolving and, as our descriptor cache got more sophisticated, the separation between sendToReplicas and sendPartialBatch no longer makes much sense. One day I'd like to unify them.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to do that; the simple fact that a context is canceled does not mean that the error you got has anything to do with the cancellation. Beyond the principle, in this particular case we don't want to overwrite an AmbiguousResultError.

I'm not sure this matters. Whether the error happened before or after the context cancellation is philosophically meaningless when the context is cancelled before the response is returned, since the caller no longer cares about it at that point (that's what it means to cancel the context). Also, someone elsewhere in the stack may well detect the context cancellation error first and decide to return that instead (consider e.g. a non-deterministic multi-channel select).

Well, the retries are infinite; only the closer or the ctx end them. Relying on infinite retries is a pattern in various places in the code base. For the DistSender, the retries are technically configurable, so I'll give you partial points; I took the opportunity to turn the fatal into an assertion error.

Got it, makes sense to let the context control the retries. I still think that implicit assumptions like these tend to turn the codebase into a bit of a minefield, so I'd rather have the code avoid it entirely, but I won't insist on it. Your call.

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to do that; the simple fact that a context is canceled does not mean that the error you got has anything to do with the cancellation. Beyond the principle, in this particular case we don't want to overwrite an AmbiguousResultError.

I'm not sure this matters. Whether the error happened before or after the context cancellation is philosophically meaningless when the context is cancelled before the response is returned, since the caller no longer cares about it at that point (that's what it means to cancel the context).

First of all, your statement as written is clearly not true in the success case. Canceling a ctx doesn't necessarily mean that the caller "no longer cares about the result". Frequently, the caller is still blocked on a result, and if the result were to be a success (i.e. if nobody cared about the ctx cancelation), the caller would much rather take a successful result than a context canceled error (particularly since the price for performing the operation has already been paid). In many cases, canceling a context simply means "hurry up". In the case of DistSQL, for example, we still expect a draining procedure to occur after a context was canceled. The "caller" is far from "not caring" about it.

Perhaps more subtly, I don't agree with what you're saying even in error case. The caller may (and does!) care about whether an operation is known to have completed (with an error) versus some layer gave up on waiting on the layer below because of a canceled ctx. If a non-idempotent operation is not known to have run to completion (and failed), then the caller needs to deal with the possibility that the operation has succeeded (or that it may succeed in the future). When it can be deduced that the operation completed, it's bad practice to overwrite that error with an ambiguous error.
The DistSender perfectly illustrates this point. Look at the handling of AmbiguousResultError. If sendPartialBatch were to override an error coming from below, it'd need to mark the error it returns as ambiguous.

As another note on the principle of the thing, keep in mind that, generally speaking, if you were to blindly overwrite other errors with context canceled errors and thus introduce new ambiguous errors, it wouldn't be just in rare cases that you've turned a non-ambiguous error into an ambiguous one. You would, in fact, turn a non-ambiguous error into an ambiguous one whenever the layer below you doesn't listen for context cancellation. In these cases, by the time you've detected the ctx cancellation, you've already paid the price of executing the operation fully. To throw away the result of the operation for no good reason after you've paid dearly for it would be an insult. This is not just theoretical; it's practical in the cases of RPCs going to the local node. For such cases, we don't have gRPC listening for ctx cancelation (we bypass gRPC), and so it's entirely up to the RPC handler to listen or not listen to ctx cancelation.

Also, someone elsewhere in the stack may well detect the context cancellation error first and decide to return that instead (consider e.g. a non-deterministic multi-channel select).

Someone up the stack may do whatever they want (as long as they deal with ambiguous vs non-ambiguous errors correctly); that's no reason for us at this lower level to not do the best we can. And in the particular case of BatchRequest RPCs, someone up the stack is not generally free to do whatever they want because the error code matters. Again, AmbiguousResultError.

Sorry for preaching.

Well, the retries are infinite; only the closer or the ctx end them. Relying on infinite retries is a pattern in various places in the code base. For the DistSender, the retries are technically configurable, so I'll give you partial points; I took the opportunity to turn the fatal into an assertion error.
Got it, makes sense to let the context control the retries. I still think that implicit assumptions like these tend to turn the codebase into a bit of a minefield, so I'd rather have the code avoid it entirely, but I won't insist on it. Your call.

I wouldn't call it an "implicit assumption" (or at least not an unreasonable implicit assumption). Anybody who doesn't want infinite retries does special configuration to bound the retries. In any case, I think the new AssertionFailedError should satisfy you enough; nothing will blow up if the "implicit assumption" is violated.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)

@andreimatei andreimatei force-pushed the small.DistSender-cancel-err branch 2 times, most recently from 781c5f3 to 8ad62e6 Compare January 23, 2022 03:30
@andreimatei
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 28, 2022

Build failed (retrying...):

craig bot pushed a commit that referenced this pull request Jan 28, 2022
70159: kvclient: switch error in ctx cancel edge case r=andreimatei a=andreimatei

This patch changes the error returned by the DistSender on a cancelled
ctx. Depending on the exactly who detects the ctx as cancelled, there
are a number of different possibilities - too many to enumerate here.
Generally, the client is supposed to get a context.Canceled error,
wrapped in different layers. The code path that this patch changes is
about the cancellation being detected by sendPartialBatch() without it
having been previously detected by sendToReplicas(). This path is
unusual (sendToReplicas() generally detects the ctx cancelled).
Depending on the exact cancellation timing, it's possible though for
sendPartialBatch() to detect it instead. In this case, this patch makes
it so that, if sendToReplicas() returned a sendError (indicating that
a suitable replica could not be reached and that the higher layer is
expected to continue trying other replicas), the error returned to the
client is a cancellation error instead of the sendError.

Touches #69419

Release note: None

75632: execstats: fix flakey TestTraceAnalyzer r=yuzefovich a=adityamaru

Fixes: #75546

Release note: None

Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Aditya Maru <[email protected]>
@craig
Copy link
Contributor

craig bot commented Jan 28, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jan 28, 2022

Build failed:

@andreimatei andreimatei force-pushed the small.DistSender-cancel-err branch from 8ad62e6 to 86fb039 Compare January 29, 2022 00:49
This patch changes the error returned by the DistSender on a cancelled
ctx. Depending on the exactly who detects the ctx as cancelled, there
are a number of different possibilities - too many to enumerate here.
Generally, the client is supposed to get a context.Canceled error,
wrapped in different layers. The code path that this patch changes is
about the cancellation being detected by sendPartialBatch() without it
having been previously detected by sendToReplicas(). This path is
unusual (sendToReplicas() generally detects the ctx cancelled). It's
also hard to test.
Depending on the exact cancellation timing, it's possible though for
sendPartialBatch() to detect it instead. In this case, this patch makes
it so that, if sendToReplicas() returned a sendError (indicating that
a suitable replica could not be reached and that the higher layer is
expected to continue trying other replicas), the error returned to the
client is a cancellation error instead of the sendError.

Touches cockroachdb#69419

Release note: None
@andreimatei andreimatei force-pushed the small.DistSender-cancel-err branch from 86fb039 to 2ee8a77 Compare January 29, 2022 01:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants