-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: kv50/enc=false/nodes=4/cpu=96/batch=64 failed [SendError bubbles up] #69419
Comments
roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ 8cae60f603ccc4d83137167b3b31cab09be9d41a:
|
roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ 44ea1fa0eba8fc78544700ef4afded62ab98a021:
|
|
I do wonder why this wasn't picked up by the dead node detector:
|
roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ 6700cf65e82a029a1ff76a75250c709aa9637170:
|
After fixing the issue with not waiting for concurrent goroutines to exit in the vectorized engine, I now see the following when running this roachtest:
As a bit of background, in recently merged #68679 we introduced parallelism of TableReaders if the flow is local. The fact that this roachtest uses batches satisfies the requirements to trigger that parallelism. Under the hood, we use |
Hm, #69612 in combination with the following diff
makes the roachtest pass. |
Never mind, two commits in #69612 seem to be sufficient. |
roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ c1ef81f5f435b3cc5bdf8b218532e0779f03a6bf:
|
My current guess is that because we're canceling the context of |
68983: backupccl: stop including historical databases in cluster backup Descs r=adityamaru a=pbardea A previous commit attempted to fix a bug where cluster backup would not include tables in dropped databases between incremental backups. That fixed aimed to find dropped databases and add it to the set of descriptors. However, this causes issues when a database is recreated with the same name. Rather than adding the dropped DBs to the Descriptors field on the backup manifest, this commit updates how DescriptorChanges are populated for cluster backups with revision history. Now, the initial scan of descriptors as of the start time will look for all descriptors in the cluster rather than just those that were resolved as of the end time of the backup. Release note (bug fix): Fix a bug where cluster revision-history backups may have included dropped descriptors in the "current" snapshot of descriptors on the cluster. Release justification: bug fix. Fix a bug where cluster revision-history backups may have included dropped descriptors in the "current" snapshot of descriptors on the cluster. 69378: bazel,ci: propagate `TEST_TMPDIR` down to go tests and capture artifacts r=jlinder a=rickystewart This fulfills a long-standing request to capture left-over artifacts in `TMPDIR` (see #59045 (comment)). Bazel sets the `TEST_TMPDIR` environment variable for the temporary directory and expects all tests to write temporary files to that directory. In our Go tests, however, we consult the `TMPDIR` environment variable to find that directory. So we pull in a custom change to `rules_go` to copy `TEST_TMPDIR` to `TMPDIR`. Update `.bazelrc` to use `/artifacts/tmp` as the `TEST_TMPDIR`. Closes #59045. Closes #69372. Release justification: Non-production code change Release note: None 69612: colflow: propagate concurrency info from vectorized to FlowBase r=yuzefovich a=yuzefovich **colflow: propagate concurrency info from vectorized to FlowBase** We've recently merged a change to introduce concurrency in the local flows. Those new concurrent goroutines are started by the vectorized parallel unordered synchronizer, and `FlowBase` isn't aware of them; as a result, `FlowBase.Wait` currently might not wait for all goroutines to exit (which is an optimization when there are no concurrent goroutines). This commit fixes the problem by propagating the information from the vectorized flow to the FlowBase. Addresses: #69419. Release note: None (no stable release with this bug) Release justification: bug fix to new functionality. **sql: loosen up the physical planning of parallel scans** This commit makes it so that in case we try to plan parallel TableReaders but encounter an error during planning, we fallback to having a single TableReader. Release note: None Release justification: bug fix to new functionality. Co-authored-by: Paul Bardea <[email protected]> Co-authored-by: Ricky Stewart <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>
Adding this diff:
seems to make the roachtest pass, but it seems wrong. @andreimatei @nvanbenschoten do you think it's possible that the context cancellation of the context passed into |
roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ 15b773c71f92d643795e34c922717fde0447f9cd:
|
This comment has been minimized.
This comment has been minimized.
roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ 42e5f9492d0d8d93638241303bca984fe78baae3:
|
roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ 44c95054cfe0108df872d2cda7952fc92259426b:
|
roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ e369d86b3bac2b9f40d760c3e6e49f55b7e15abe:
|
roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ 642e44afbe6098f25022618be76c6f7b6b97df45:
|
I think I see the problem. The way one of the DistSender replicas loop is written, it's possible that, as a result of a ctx cancellation, we return the error we got when trying out a previous replica instead of returning the ctx error. Which is not good. I'll improve the thing. But, still, let's discuss about the downstream effects because I'm not sure that it's a good idea for people to rely on a particular contract from the DistSender about the error returned in case of ctx canceled.
I've read the code, but can you distill these lines more for me pls? Looks like the synchronizer is trying to figure out if it's draining and, if it is, whether it should swallow an error. First of all, it seems to me that I'd be better if the synchronizer looked at its own state to see if it's draining instead of inferring it from the errors. No? Second, I'd like to understand the decision about swallowing errors. The thing we don't want, I think, is for |
Yes, that's essentially what we're trying to do.
Yeah, we could do that, but that wouldn't be sufficient for our purposes (note that we do assert that the synchronizer is in the draining state in The goal of that code is as follows: we have encountered an error, and the question is - did this error occur precisely because we eagerly canceled the work since the synchronizer moved into the draining state? If that's the case, we want to swallow the error so that it doesn't propagate out. So checking that the state is, in fact, draining is a good point (I'll make that change), but we also want to make sure not to swallow errors that occurred legitimately (i.e. not because of the eager cancellation triggered in
I believe we are covered on this front. We use this eager context cancellation only for fully local plans (i.e. there are no inboxes, so we won't break the gRPC streams by this eager context cancellation). Next, because we propagate the metadata differently in the vectorized engine (namely, all metadata is buffered and returned in Does this make sense? I know that this error/string matching is not great, and the contract that all internal components cannot convert the context cancellation into any other error that doesn't contain "context canceled" when stringified is fragile, but it seems to work as is. Do you have any other suggestions about swallowing errors that occurred because of canceling the in-flight work? We will need to have something similar for the |
This comment has been minimized.
This comment has been minimized.
It makes some sense. I was thinking that some distinction between "fully local plans" and the more general case is probably part of the answer.
I wonder why exactly this is - why do we need a distinction between errors caused by the cancellation and other errors? It seems to me that, as long as we can get all the metadata, any error received after the synchronizer started draining can be ignored. And if the everything is local, we can always get the metadata. Right? This is all in theory, though. In practice, the cockroach/pkg/kv/kvclient/kvcoord/txn_coord_sender.go Lines 873 to 882 in b3a9b70
So, I guess my question is - how does your swallowing of ctx cancelled errors in the synchronizer even work given that the TxnCoordSender remembers the error state? Do we perhaps only do it when using a |
I was also thinking about this when I originally added this code, and I came up with a following scenario: imagine we have Another scenario is when we have a "race" between the input goroutine doing work and the synchronizer transitioning into the draining state. Imagine that the input goroutine has performed some work and encountered an error that should be returned back to the client, but before
Yep, that's right. Now that we're discussing this, your suggestion of swallowing all errors once the synchronizer transitioned into draining and performed the context cancellation of the inputs seems like it might be actually a safer choice - this will allow us to not require any invariants from the internal components about how they handle the context cancellation errors. I guess the scenarios I mentioned above could go either way anyway.
We are using this functionality of eager cancellation only in very limited set of circumstances - the query doesn't have any mutations, is executed locally, and a few other requirements, so we aren't concerned about writes. And we do use |
In my opinion, for both scenarios you describe, swallowing errors is fine. When an error races with a drain, I think either outcome is fine. In the
So we only use a synchronizer is we have parallelism, and if we have parallelism, we're using a LeafTxn for all parallel branches? We don't use the root txn on any of the branches? Cause if we were, and it was the request on the root that was getting canceled, you wouldn't be able to swallow that error (unfortunately). Figuring out the root txn behavior after an error is a can of worms that we keep passing down the road; the current story is not very good.
Yeah; if the root is not in play here, then this is what I'd do.
|
Yes, we will use only the cockroach/pkg/sql/distsql/server.go Lines 307 to 313 in db49ac0
Thanks for the context, it makes sense. I think #69961 without any other changes is sufficient to fix this issue. |
This patch changes the error returned by the DistSender on a cancelled ctx. Depending on the exactly who detects the ctx as cancelled, there are a number of different possibilities - too many to enumerate here. Generally, the client is supposed to get a context.Canceled error, wrapped in different layers. The code path that this patch changes is about the cancellation being detected by sendPartialBatch() without it having been previously detected by sendToReplicas(). This path is unusual (sendToReplicas() generally detects the ctx cancelled). It's also hard to test. Depending on the exact cancellation timing, it's possible though for sendPartialBatch() to detect it instead. In this case, this patch makes it so that, if sendToReplicas() returned a sendError (indicating that a suitable replica could not be reached and that the higher layer is expected to continue trying other replicas), the error returned to the client is a cancellation error instead of the sendError. Touches cockroachdb#69419 Release note: None
This patch changes the error returned by the DistSender on a cancelled ctx. Depending on the exactly who detects the ctx as cancelled, there are a number of different possibilities - too many to enumerate here. Generally, the client is supposed to get a context.Canceled error, wrapped in different layers. The code path that this patch changes is about the cancellation being detected by sendPartialBatch() without it having been previously detected by sendToReplicas(). This path is unusual (sendToReplicas() generally detects the ctx cancelled). It's also hard to test. Depending on the exact cancellation timing, it's possible though for sendPartialBatch() to detect it instead. In this case, this patch makes it so that, if sendToReplicas() returned a sendError (indicating that a suitable replica could not be reached and that the higher layer is expected to continue trying other replicas), the error returned to the client is a cancellation error instead of the sendError. Touches cockroachdb#69419 Release note: None
This patch changes the error returned by the DistSender on a cancelled ctx. Depending on the exactly who detects the ctx as cancelled, there are a number of different possibilities - too many to enumerate here. Generally, the client is supposed to get a context.Canceled error, wrapped in different layers. The code path that this patch changes is about the cancellation being detected by sendPartialBatch() without it having been previously detected by sendToReplicas(). This path is unusual (sendToReplicas() generally detects the ctx cancelled). It's also hard to test. Depending on the exact cancellation timing, it's possible though for sendPartialBatch() to detect it instead. In this case, this patch makes it so that, if sendToReplicas() returned a sendError (indicating that a suitable replica could not be reached and that the higher layer is expected to continue trying other replicas), the error returned to the client is a cancellation error instead of the sendError. Touches cockroachdb#69419 Release note: None
70159: kvclient: switch error in ctx cancel edge case r=andreimatei a=andreimatei This patch changes the error returned by the DistSender on a cancelled ctx. Depending on the exactly who detects the ctx as cancelled, there are a number of different possibilities - too many to enumerate here. Generally, the client is supposed to get a context.Canceled error, wrapped in different layers. The code path that this patch changes is about the cancellation being detected by sendPartialBatch() without it having been previously detected by sendToReplicas(). This path is unusual (sendToReplicas() generally detects the ctx cancelled). Depending on the exact cancellation timing, it's possible though for sendPartialBatch() to detect it instead. In this case, this patch makes it so that, if sendToReplicas() returned a sendError (indicating that a suitable replica could not be reached and that the higher layer is expected to continue trying other replicas), the error returned to the client is a cancellation error instead of the sendError. Touches #69419 Release note: None 75632: execstats: fix flakey TestTraceAnalyzer r=yuzefovich a=adityamaru Fixes: #75546 Release note: None Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Aditya Maru <[email protected]>
This patch changes the error returned by the DistSender on a cancelled ctx. Depending on the exactly who detects the ctx as cancelled, there are a number of different possibilities - too many to enumerate here. Generally, the client is supposed to get a context.Canceled error, wrapped in different layers. The code path that this patch changes is about the cancellation being detected by sendPartialBatch() without it having been previously detected by sendToReplicas(). This path is unusual (sendToReplicas() generally detects the ctx cancelled). It's also hard to test. Depending on the exact cancellation timing, it's possible though for sendPartialBatch() to detect it instead. In this case, this patch makes it so that, if sendToReplicas() returned a sendError (indicating that a suitable replica could not be reached and that the higher layer is expected to continue trying other replicas), the error returned to the client is a cancellation error instead of the sendError. Touches cockroachdb#69419 Release note: None
roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ ab1fc343c9a1140191f96353995258e609a84d02:
Reproduce
See: roachtest README
This test on roachdash | Improve this report!
The text was updated successfully, but these errors were encountered: