-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvclient: switch error in ctx cancel edge case #70159
base: master
Are you sure you want to change the base?
kvclient: switch error in ctx cancel edge case #70159
Conversation
I've read the PR message twice, stared at the diff twice, and back. I still cannot figure out what difference this diff actually makes. Could you elucidate a bit more? |
The idea is that if we get a sendError and the context is canceled, we need to hit the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if we hit a send error, don't we continue in the retry loop, i.e. we'll try to send again (in the common case) and then get a context canceled error back from that directly?
No. I think you're missing the fact that sendPartialBatch
uses retry.StartWithCtx
, so it won't try when the ctx is canceled.
The scenario the patch changes is about sendToReplicas
returning a sendError
, and the ctx
also being canceled at this point. In that case, what we want is to take the deduceEarlyExitRetryError
path. Before this patch, that path was not taken - the assignment of the pErr
that this patch moves was inhibiting it. With the patch, we're no longer inhibiting it. Makes sense? If so, I'll reword the commit msg somehow.
This patch was prompted by some roachtest failures that Yahor was working on. Something in SQL was expecting a ctx canceled error, and it wasn't getting it (it was getting a sendError
instead). It's not easy to trigger the case that the patch is addressing because sendToReplicas
generally doesn't return a sendError
if the ctx is canceled while sendToReplicas
is running. I forgot if I convinced myself that that can, in fact, happen, or whether I believe that the only case where this patch makes a difference is when the ctx is canceled exactly as sendToReplicas
is returning. Yahor's TPCC surprisingly seemed able to produce this scenario somewhat reliably.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it might be better to keep the current structure, but instead override pErr
if the context was cancelled, thereby assuming that the context cancellation caused the error. E.g.:
// Propagate error if either the retry closer or context done
// channels were closed.
if err := ds.deduceRetryEarlyExitError(ctx); err != nil {
pErr = roachpb.NewError(err)
} else if pErr == nil {
log.Fatal(ctx, "exited retry loop without an error")
}
At least to me, this expresses the intent better than the proposed patch: there was an error, but if the context was cancelled then that's the error we care about, regardless of how it manifested further down the stack.
Also, won't this patch end up with pErr = nil
if the retries are exhausted and all of the errors were sendError
? This will currently trigger the log.Fatal()
, which would be bad.
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)
9ddcc52
to
9e34fe0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but instead override pErr if the context was cancelled, thereby assuming that the context cancellation caused the error
I don't want to do that; the simple fact that a context is canceled does not mean that the error you got has anything to do with the cancellation. Beyond the principle, in this particular case we don't want to overwrite an AmbiguousResultError
.
Btw, I've also convinced the SQL module that caused this hubbub to not make assumptions about what errors it might get after a canceled ctx.
Also, won't this patch end up with pErr = nil if the retries are exhausted and all of the errors were sendError? This will currently trigger the log.Fatal(), which would be bad.
Well, the retries are infinite; only the closer or the ctx end them. Relying on infinite retries is a pattern in various places in the code base. For the DistSender
, the retries are technically configurable, so I'll give you partial points; I took the opportunity to turn the fatal into an assertion error.
Btw, I should say that the current nested structure of the DistSender loops, with retries at multiple levels, is far from ideal. The code kept evolving and, as our descriptor cache got more sophisticated, the separation between sendToReplicas
and sendPartialBatch
no longer makes much sense. One day I'd like to unify them.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to do that; the simple fact that a context is canceled does not mean that the error you got has anything to do with the cancellation. Beyond the principle, in this particular case we don't want to overwrite an
AmbiguousResultError
.
I'm not sure this matters. Whether the error happened before or after the context cancellation is philosophically meaningless when the context is cancelled before the response is returned, since the caller no longer cares about it at that point (that's what it means to cancel the context). Also, someone elsewhere in the stack may well detect the context cancellation error first and decide to return that instead (consider e.g. a non-deterministic multi-channel select
).
Well, the retries are infinite; only the closer or the ctx end them. Relying on infinite retries is a pattern in various places in the code base. For the
DistSender
, the retries are technically configurable, so I'll give you partial points; I took the opportunity to turn the fatal into an assertion error.
Got it, makes sense to let the context control the retries. I still think that implicit assumptions like these tend to turn the codebase into a bit of a minefield, so I'd rather have the code avoid it entirely, but I won't insist on it. Your call.
Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to do that; the simple fact that a context is canceled does not mean that the error you got has anything to do with the cancellation. Beyond the principle, in this particular case we don't want to overwrite an AmbiguousResultError.
I'm not sure this matters. Whether the error happened before or after the context cancellation is philosophically meaningless when the context is cancelled before the response is returned, since the caller no longer cares about it at that point (that's what it means to cancel the context).
First of all, your statement as written is clearly not true in the success case. Canceling a ctx doesn't necessarily mean that the caller "no longer cares about the result". Frequently, the caller is still blocked on a result, and if the result were to be a success (i.e. if nobody cared about the ctx cancelation), the caller would much rather take a successful result than a context canceled error (particularly since the price for performing the operation has already been paid). In many cases, canceling a context simply means "hurry up". In the case of DistSQL, for example, we still expect a draining procedure to occur after a context was canceled. The "caller" is far from "not caring" about it.
Perhaps more subtly, I don't agree with what you're saying even in error case. The caller may (and does!) care about whether an operation is known to have completed (with an error) versus some layer gave up on waiting on the layer below because of a canceled ctx. If a non-idempotent operation is not known to have run to completion (and failed), then the caller needs to deal with the possibility that the operation has succeeded (or that it may succeed in the future). When it can be deduced that the operation completed, it's bad practice to overwrite that error with an ambiguous error.
The DistSender
perfectly illustrates this point. Look at the handling of AmbiguousResultError
. If sendPartialBatch
were to override an error coming from below, it'd need to mark the error it returns as ambiguous.
As another note on the principle of the thing, keep in mind that, generally speaking, if you were to blindly overwrite other errors with context canceled errors and thus introduce new ambiguous errors, it wouldn't be just in rare cases that you've turned a non-ambiguous error into an ambiguous one. You would, in fact, turn a non-ambiguous error into an ambiguous one whenever the layer below you doesn't listen for context cancellation. In these cases, by the time you've detected the ctx cancellation, you've already paid the price of executing the operation fully. To throw away the result of the operation for no good reason after you've paid dearly for it would be an insult. This is not just theoretical; it's practical in the cases of RPCs going to the local node. For such cases, we don't have gRPC listening for ctx cancelation (we bypass gRPC), and so it's entirely up to the RPC handler to listen or not listen to ctx cancelation.
Also, someone elsewhere in the stack may well detect the context cancellation error first and decide to return that instead (consider e.g. a non-deterministic multi-channel select).
Someone up the stack may do whatever they want (as long as they deal with ambiguous vs non-ambiguous errors correctly); that's no reason for us at this lower level to not do the best we can. And in the particular case of BatchRequest
RPCs, someone up the stack is not generally free to do whatever they want because the error code matters. Again, AmbiguousResultError
.
Sorry for preaching.
Well, the retries are infinite; only the closer or the ctx end them. Relying on infinite retries is a pattern in various places in the code base. For the DistSender, the retries are technically configurable, so I'll give you partial points; I took the opportunity to turn the fatal into an assertion error.
Got it, makes sense to let the context control the retries. I still think that implicit assumptions like these tend to turn the codebase into a bit of a minefield, so I'd rather have the code avoid it entirely, but I won't insist on it. Your call.
I wouldn't call it an "implicit assumption" (or at least not an unreasonable implicit assumption). Anybody who doesn't want infinite retries does special configuration to bound the retries. In any case, I think the new AssertionFailedError
should satisfy you enough; nothing will blow up if the "implicit assumption" is violated.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)
781c5f3
to
8ad62e6
Compare
bors r+ |
Build failed (retrying...): |
70159: kvclient: switch error in ctx cancel edge case r=andreimatei a=andreimatei This patch changes the error returned by the DistSender on a cancelled ctx. Depending on the exactly who detects the ctx as cancelled, there are a number of different possibilities - too many to enumerate here. Generally, the client is supposed to get a context.Canceled error, wrapped in different layers. The code path that this patch changes is about the cancellation being detected by sendPartialBatch() without it having been previously detected by sendToReplicas(). This path is unusual (sendToReplicas() generally detects the ctx cancelled). Depending on the exact cancellation timing, it's possible though for sendPartialBatch() to detect it instead. In this case, this patch makes it so that, if sendToReplicas() returned a sendError (indicating that a suitable replica could not be reached and that the higher layer is expected to continue trying other replicas), the error returned to the client is a cancellation error instead of the sendError. Touches #69419 Release note: None 75632: execstats: fix flakey TestTraceAnalyzer r=yuzefovich a=adityamaru Fixes: #75546 Release note: None Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Aditya Maru <[email protected]>
Build failed (retrying...): |
Build failed: |
8ad62e6
to
86fb039
Compare
This patch changes the error returned by the DistSender on a cancelled ctx. Depending on the exactly who detects the ctx as cancelled, there are a number of different possibilities - too many to enumerate here. Generally, the client is supposed to get a context.Canceled error, wrapped in different layers. The code path that this patch changes is about the cancellation being detected by sendPartialBatch() without it having been previously detected by sendToReplicas(). This path is unusual (sendToReplicas() generally detects the ctx cancelled). It's also hard to test. Depending on the exact cancellation timing, it's possible though for sendPartialBatch() to detect it instead. In this case, this patch makes it so that, if sendToReplicas() returned a sendError (indicating that a suitable replica could not be reached and that the higher layer is expected to continue trying other replicas), the error returned to the client is a cancellation error instead of the sendError. Touches cockroachdb#69419 Release note: None
86fb039
to
2ee8a77
Compare
This patch changes the error returned by the DistSender on a cancelled
ctx. Depending on the exactly who detects the ctx as cancelled, there
are a number of different possibilities - too many to enumerate here.
Generally, the client is supposed to get a context.Canceled error,
wrapped in different layers. The code path that this patch changes is
about the cancellation being detected by sendPartialBatch() without it
having been previously detected by sendToReplicas(). This path is
unusual (sendToReplicas() generally detects the ctx cancelled).
Depending on the exact cancellation timing, it's possible though for
sendPartialBatch() to detect it instead. In this case, this patch makes
it so that, if sendToReplicas() returned a sendError (indicating that
a suitable replica could not be reached and that the higher layer is
expected to continue trying other replicas), the error returned to the
client is a cancellation error instead of the sendError.
Touches #69419
Release note: None