Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: acceptance/version-upgrade failed [ambiguous result] #65701

Closed
cockroach-teamcity opened this issue May 26, 2021 · 14 comments · Fixed by #69167
Closed

roachtest: acceptance/version-upgrade failed [ambiguous result] #65701

cockroach-teamcity opened this issue May 26, 2021 · 14 comments · Fixed by #69167
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team T-server-and-security DB Server & Security T-sql-queries SQL Queries Team

Comments

@cockroach-teamcity
Copy link
Member

roachtest.acceptance/version-upgrade failed with artifacts on master @ 4192f1d5b58257bfe796bc33f774b11a33ae39a7:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: artifacts/acceptance/version-upgrade/run_1
	versionupgrade.go:405,versionupgrade.go:200,versionupgrade.go:188,acceptance.go:47,acceptance.go:101,test_runner.go:792: pq: log-event: result is ambiguous (error=rpc error: code = Unavailable desc = transport is closing [exhausted])
Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh acceptance/version-upgrade

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels May 26, 2021
@tbg
Copy link
Member

tbg commented May 26, 2021

Ambiguous result returned from here:

rows, err := ex.Exec(ctx, "log-event", txn, query, args...)

I wonder how this can happen, we're not committing the txn here, how can this return an ambiguous result? Looks like this came from DistSender somewhere.

@erikgrinaker
Copy link
Contributor

I wonder how this can happen, we're not committing the txn here, how can this return an ambiguous result? Looks like this came from DistSender somewhere.

I think any write operation could result in an ambiguous result. For example, if a context is cancelled during Raft execution, we always return an ambiguous result.

@tbg
Copy link
Member

tbg commented May 26, 2021

We shouldn't have to though, no? An ambiguous result is always recoverable unless it's part of a commit attempt. That being said... any intent write can now be part of a commit attempt (thx parallel commits). So we need to return them at that level, but we can terminate them higher up (turning them into a txn retry) unless we were actually trying to commit.

That said, I'm not sure the error here comes from Raft. I thought it looked like this one came from DistSender, and there we are trying to avoid them whenever possible. I might be wrong.

@erikgrinaker
Copy link
Contributor

We shouldn't have to though, no? An ambiguous result is always recoverable unless it's part of a commit attempt.

It's possible we take efforts to recover higher up, not familiar with this yet.

That said, I'm not sure the error here comes from Raft. I thought it looked like this one came from DistSender, and there we are trying to avoid them whenever possible.

I wasn't implying that this was necessarily from Raft. Just extrapolating that we seem to return ambiguous results from certain write errors, particularly context cancellation, so wouldn't be surprised to see that elsewhere. But I don't have all the context here (heh).

tbg added a commit to tbg/cockroach that referenced this issue May 26, 2021
This error type held on to a string message, which is unsuitable for
redaction. In practice, this error opaquely wraps an error, so make this
transparent, which we can now do thanks to `cockroachdb/errors`.

It turned out that the proto already had an unused field that was (once,
but no longer) used in testing, so we can use that in production now.

I chose to print the root error's file:line in the error as we typically
don't print errors via `%+v` which is otherwise how these would be
exposed. I expect that to be mildly useful to diagnose problems such as
\cockroachdb#65701.

I think there is room for a more general mechanism that exposes the root
cause's location in general, though I am unsure how to best introduce it
given that we usually observe errors through SQL, and on many possible
paths especially for rare flakes such as the above, so no steps towards
that are taken at this point.

Release note: None
@tbg
Copy link
Member

tbg commented May 26, 2021

This particular ambiguous result comes from here:

https://github.com/cockroachdb/cockroach/blob/f45ec2c959476595653e5c5242d9f3f69700a33e/pkg/kv/kvclient/kvcoord/dist_sender.go#L2099-L2101

(this is the only case with the [exhausted] we're seeing here). This must then have been called from

https://github.com/cockroachdb/cockroach/blob/f45ec2c959476595653e5c5242d9f3f69700a33e/pkg/kv/kvclient/kvcoord/dist_sender.go#L1825

and the only write to ambiguousError is here:

https://github.com/cockroachdb/cockroach/blob/f45ec2c959476595653e5c5242d9f3f69700a33e/pkg/kv/kvclient/kvcoord/dist_sender.go#L1919-L1921

As I expected, this implies withCommit, which.. wasn't true here? The error comes from here

rows, err := ex.Exec(ctx, "log-event", txn, query, args...)

and it's certainly not committing txn here.

I also think that if the KV layer returned an ambiguous result, this wouldn't hit this code path (I'm fairly sure... unless there is some weird conversion going on somewhere; but also the message wrapped in the ambiguous error seems pretty clearly distsender-side). So it does come from the distsender snippet above. But why would withCommit be true? It's set here and should clearly be false:

https://github.com/cockroachdb/cockroach/blob/f45ec2c959476595653e5c5242d9f3f69700a33e/pkg/kv/kvclient/kvcoord/dist_sender.go#L744-L748

@tbg
Copy link
Member

tbg commented May 26, 2021

I suspected that perhaps p.txn (which is used for this txn) is perhaps nil and that we'd fall into some auto-commit path, but no, can not find any evidence for that.

As an aside, all nodes were running master when the failure got emitted. This shouldn't really matter as I don't think there have been any changes from 21.1 in this area.

@tbg tbg changed the title roachtest: acceptance/version-upgrade failed roachtest: acceptance/version-upgrade failed [ambiguous result May 31, 2021
@tbg tbg changed the title roachtest: acceptance/version-upgrade failed [ambiguous result roachtest: acceptance/version-upgrade failed [ambiguous result] May 31, 2021
tbg added a commit to tbg/cockroach that referenced this issue May 31, 2021
This error type held on to a string message, which is unsuitable for
redaction. In practice, this error opaquely wraps an error, so make this
transparent, which we can now do thanks to `cockroachdb/errors`.

It turned out that the proto already had an unused field that was (once,
but no longer) used in testing, so we can use that in production now.

I chose to print the root error's file:line in the error as we typically
don't print errors via `%+v` which is otherwise how these would be
exposed. I expect that to be mildly useful to diagnose problems such as
\cockroachdb#65701.

I think there is room for a more general mechanism that exposes the root
cause's location in general, though I am unsure how to best introduce it
given that we usually observe errors through SQL, and on many possible
paths especially for rare flakes such as the above, so no steps towards
that are taken at this point.

Release note: None
@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
@cockroach-teamcity
Copy link
Member Author

roachtest.acceptance/version-upgrade failed with artifacts on master @ 8758904a4cb48d3308e8c2be4289ec192fabaa13:

The test failed on branch=master, cloud=local:
test artifacts and logs in: artifacts/acceptance/version-upgrade/run_1
	versionupgrade.go:343,versionupgrade.go:532,versionupgrade.go:514,versionupgrade.go:213,versionupgrade.go:201,acceptance.go:51,acceptance.go:104,test_runner.go:765: pq: operation "show cluster setting version" timed out after 2m0s: value differs between local setting ([18 8 8 21 16 1 24 0 32 112]) and KV ([18 8 8 21 16 1 24 0 32 0]); try again later (<nil> after 1m59.130466935s)
Reproduce

To reproduce, try:

## Simple repro (linux-only):
  $ make cockroachshort bin/worklaod bin/roachprod bin/roachtest
  $ PATH=$PWD/bin:$PATH roachtest run acceptance/version-upgrade --local

## Proper repro probably needs more roachtest flags, or running
## the programs remotely on GCE. For more details, refer to
## pkg/cmd/roachtest/README.md.

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.acceptance/version-upgrade failed with artifacts on master @ 8d0f75eb31c16bddf9f8c5be2c4fc08a087f58d6:

The test failed on branch=master, cloud=local:
test artifacts and logs in: artifacts/acceptance/version-upgrade/run_1
	versionupgrade.go:318,versionupgrade.go:130,versionupgrade.go:213,versionupgrade.go:201,acceptance.go:55,acceptance.go:108,test_runner.go:777: dial tcp 127.0.0.1:26257: connect: connection refused

	cluster.go:1249,context.go:89,cluster.go:1237,test_runner.go:866: dead node detection: /go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor local --oneshot --ignore-empty-nodes: exit status 1 1: dead (exit status unknown)
		4: 11267
		3: 11519
		2: 11393
		Error: UNCLASSIFIED_PROBLEM: 1: dead (exit status unknown)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1173
		  | main.wrap.func1
		  | 	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:281
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2107
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (3) 1: dead (exit status unknown)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError
Reproduce

See: roachtest README

See: CI job to stress roachtests

For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^acceptance/version-upgrade$` * Parameters / `env.COUNT`: <number of runs>

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@tbg
Copy link
Member

tbg commented Aug 20, 2021

cockroach server exited with error: ensuring SQL migrations: failed to run migration "add indexes on columns revokedAt and lastUsedAt for system.web_sessions table": add-web-sessions-revoked-at-idx: log-event: result is ambiguous (error=rpc error: code = Unavailable desc = error reading from server: EOF [exhausted])

@tbg tbg removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Aug 20, 2021
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Aug 25, 2021
@tbg
Copy link
Member

tbg commented Aug 25, 2021

Looks like this migration (and these migrations in general) could retry a little bit more aggressively, especially on ambiguous results, which have to be handled (and aren't). Based on my reading of the code, they don't at all.
It's unclear where the "error reading from server" comes from, though.

@knz
Copy link
Contributor

knz commented Aug 25, 2021

This las migration will be replaced by a long running migration instead (by @cameronnunez ). Cameron please link this issue from your PR too.

@blathers-crl blathers-crl bot added the T-server-and-security DB Server & Security label Aug 31, 2021
@craig craig bot closed this as completed in #69167 Sep 2, 2021
@cameronnunez cameronnunez reopened this Sep 2, 2021
@cockroach-teamcity

This comment has been minimized.

@cockroach-teamcity

This comment has been minimized.

@knz
Copy link
Contributor

knz commented Sep 13, 2021

Closing as this hasn't failed since Cameron's PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team T-server-and-security DB Server & Security T-sql-queries SQL Queries Team
Projects
None yet
6 participants