Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachprod,roachtest: backport recent improvements #105831

Conversation

renatolabs
Copy link
Contributor

@renatolabs renatolabs commented Jun 29, 2023

This backports more recent improvements to the roachtest/roachprod infrastructure, specifically:

This will make it much easier to backport other changes that build on top of these improvements.

Epic: None

Release notes: None

Release justification: test-only changes.

@renatolabs renatolabs requested a review from a team as a code owner June 29, 2023 17:11
@renatolabs renatolabs requested review from srosenberg and smg260 and removed request for a team June 29, 2023 17:11
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Miral Gadani and others added 6 commits June 29, 2023 14:44
…es fails.

Previously, when encountering an error on a node, we would wait until
the command finished on all the specified nodes until returning to the
caller. This meant long-running workloads would continue until completion
despite having already failed on at least one of the nodes.

The context is now threaded through the call stack, allowing the
c.Parallel to issue a cancellation for early termination.

Resolves: cockroachdb#98520
Release note: none
Epic: none
as we no longer wait for all the nodes to complete in the event of an
error on any node.
however sometimes it is desirable for the command to complete running on
all nodes despite failures.

e.g. `roachprod run <cluster> 'ls /mnt/data1/cockroach'` or when
collecting dmesg logs after a test.

This commit has also switched to functional options for c.Parallel for
more flexibility and allow calls to specify WithWaitOnFail().

Epic: none
Fixes: cockroachdb#101150

Release note: None
Previously, we added fail fast behaviour when executing commands across
nodes using roachtest/roachprod. There are some instances, specifically
from the CLI, that should wait for all results to be returned. This PR
adds `WithWaitOnFail()` to `ExecSQL()` in roachprod.

Epic: none
Fixes: cockroachdb#104342

Release note: None
Starting a cluster locally does a check for certificates. In the event the
certificates are not found, which is a valid case, an error is printed. The
start command works correctly, but the error can cause confusion:

```bash
/bin/roachprod start local --secure 12:47:56 cluster_synced.go:2475: 0:
COMMAND_PROBLEM: exit status 1 (1) COMMAND_PROBLEM Wraps: (2) exit status 1
Error types: (1) errors.Cmd (2) *exec.ExitError: local: initializing certs 1/1 /
local: distributing certs 2/2
```

This change modifies the command to do a check without causing an error, and
also propagates any real errors that could occur while doing the check.
For some reason, the current form of `fileExistsOnFirstNode`
can return `found=true` when it should return `found=false`.

This can be reproduced by running the `multitenant-upgrade`
roachtest and seeing it hang at:
```
multitenant_upgrade.go:154: test status: checking the pre-upgrade sql server still works after the system tenant binary upgrade
```
because of a `TLS handshake error`.

Note: the test is also broken because of another issue
so with this fix it should now fail with:
```
(assertions.go:333).Fail:
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/multitenant_upgrade.go:390
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/multitenant_upgrade.go:189
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/multitenant_upgrade.go:38
	            				main/pkg/cmd/roachtest/test_runner.go:1060
	            				GOROOT/src/runtime/asm_arm64.s:1172
	Error:      	Not equal:
	            	expected: [][]string{[]string{"23.1"}}
	            	actual  : [][]string{[]string{"22.2"}}

	            	Diff:
	            	--- Expected
	            	+++ Actual
	            	@@ -2,3 +2,3 @@
	            	  ([]string) (len=1) {
	            	-  (string) (len=4) "23.1"
	            	+  (string) (len=4) "22.2"
	            	  }
	Test:       	multitenant-upgrade
```

Release note: None
Epic: none
@renatolabs renatolabs force-pushed the rc/backport-100403-101222-104343-104777-105514 branch from e93daf5 to 3a12d96 Compare June 29, 2023 18:45
@srosenberg srosenberg requested a review from herkolategan June 29, 2023 18:50
@renatolabs
Copy link
Contributor Author

Nightly build with SELECT_PROBABILITY=0.5. No failures so far:

https://teamcity.cockroachdb.com/viewLog.html?buildId=10733019&buildTypeId=Cockroach_Nightlies_RoachtestNightlyGceBazel&tab=buildResultsDiv

@renatolabs
Copy link
Contributor Author

Failures on the build above also happen on master and are unrelated to the changes introduced here. Merging. TFTR!

@renatolabs renatolabs merged commit a9d0bbe into cockroachdb:release-23.1 Jul 4, 2023
@renatolabs renatolabs deleted the rc/backport-100403-101222-104343-104777-105514 branch July 4, 2023 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants