-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: acceptance/version-upgrade failed [ambiguous result] #65701
Comments
Ambiguous result returned from here: cockroach/pkg/sql/event_log.go Line 515 in 5d96260
I wonder how this can happen, we're not committing the txn here, how can this return an ambiguous result? Looks like this came from DistSender somewhere. |
I think any write operation could result in an ambiguous result. For example, if a context is cancelled during Raft execution, we always return an ambiguous result. |
We shouldn't have to though, no? An ambiguous result is always recoverable unless it's part of a commit attempt. That being said... any intent write can now be part of a commit attempt (thx parallel commits). So we need to return them at that level, but we can terminate them higher up (turning them into a txn retry) unless we were actually trying to commit. That said, I'm not sure the error here comes from Raft. I thought it looked like this one came from DistSender, and there we are trying to avoid them whenever possible. I might be wrong. |
It's possible we take efforts to recover higher up, not familiar with this yet.
I wasn't implying that this was necessarily from Raft. Just extrapolating that we seem to return ambiguous results from certain write errors, particularly context cancellation, so wouldn't be surprised to see that elsewhere. But I don't have all the context here (heh). |
This error type held on to a string message, which is unsuitable for redaction. In practice, this error opaquely wraps an error, so make this transparent, which we can now do thanks to `cockroachdb/errors`. It turned out that the proto already had an unused field that was (once, but no longer) used in testing, so we can use that in production now. I chose to print the root error's file:line in the error as we typically don't print errors via `%+v` which is otherwise how these would be exposed. I expect that to be mildly useful to diagnose problems such as \cockroachdb#65701. I think there is room for a more general mechanism that exposes the root cause's location in general, though I am unsure how to best introduce it given that we usually observe errors through SQL, and on many possible paths especially for rare flakes such as the above, so no steps towards that are taken at this point. Release note: None
This particular ambiguous result comes from here: (this is the only case with the and the only write to As I expected, this implies cockroach/pkg/sql/event_log.go Line 515 in 5d96260
and it's certainly not committing I also think that if the KV layer returned an ambiguous result, this wouldn't hit this code path (I'm fairly sure... unless there is some weird conversion going on somewhere; but also the message wrapped in the ambiguous error seems pretty clearly distsender-side). So it does come from the distsender snippet above. But why would |
I suspected that perhaps As an aside, all nodes were running |
This error type held on to a string message, which is unsuitable for redaction. In practice, this error opaquely wraps an error, so make this transparent, which we can now do thanks to `cockroachdb/errors`. It turned out that the proto already had an unused field that was (once, but no longer) used in testing, so we can use that in production now. I chose to print the root error's file:line in the error as we typically don't print errors via `%+v` which is otherwise how these would be exposed. I expect that to be mildly useful to diagnose problems such as \cockroachdb#65701. I think there is room for a more general mechanism that exposes the root cause's location in general, though I am unsure how to best introduce it given that we usually observe errors through SQL, and on many possible paths especially for rare flakes such as the above, so no steps towards that are taken at this point. Release note: None
roachtest.acceptance/version-upgrade failed with artifacts on master @ 8758904a4cb48d3308e8c2be4289ec192fabaa13:
Reproduce
To reproduce, try: ## Simple repro (linux-only):
$ make cockroachshort bin/worklaod bin/roachprod bin/roachtest
$ PATH=$PWD/bin:$PATH roachtest run acceptance/version-upgrade --local
## Proper repro probably needs more roachtest flags, or running
## the programs remotely on GCE. For more details, refer to
## pkg/cmd/roachtest/README.md. |
roachtest.acceptance/version-upgrade failed with artifacts on master @ 8d0f75eb31c16bddf9f8c5be2c4fc08a087f58d6:
Reproduce
See: roachtest README See: CI job to stress roachtests For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^acceptance/version-upgrade$` * Parameters / `env.COUNT`: <number of runs> |
|
Looks like this migration (and these migrations in general) could retry a little bit more aggressively, especially on ambiguous results, which have to be handled (and aren't). Based on my reading of the code, they don't at all. |
This las migration will be replaced by a long running migration instead (by @cameronnunez ). Cameron please link this issue from your PR too. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Closing as this hasn't failed since Cameron's PR. |
roachtest.acceptance/version-upgrade failed with artifacts on master @ 4192f1d5b58257bfe796bc33f774b11a33ae39a7:
Reproduce
To reproduce, try:
# From https://go.crdb.dev/p/roachstress, perhaps edited lightly. caffeinate ./roachstress.sh acceptance/version-upgrade
This test on roachdash | Improve this report!
The text was updated successfully, but these errors were encountered: