Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: failing to upload initial binaries or libraries causes entire test suite to fail #109279

Closed
renatolabs opened this issue Aug 22, 2023 · 1 comment · Fixed by #109851
Labels
A-testing Testing tools and infrastructure C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-testeng TestEng Team

Comments

@renatolabs
Copy link
Contributor

renatolabs commented Aug 22, 2023

In a recent nightly build [1], we observed that the test suite terminated abruptly before all tests had a chance to run. The reason for this is that one of the workers failed to upload the default cockroach binary to the cluster that was going to be used in a test run:

18594244 [16:10:02] : [Step 2/3] Worker 1 returned with error. Quiescing. Error: cluster.PutE: put /go/src/github.com/cockroachdb/cockroach/bin/cockroach.linux-amd64 failed: ~ scp -r -C -o StrictHostKeyChecking=no -o ConnectTimeout=10 -i /home/roach/.ssh/id_rsa -i /home/roach/18594244 .ssh/google_compute_engine [email protected]:./cockroach-default [email protected]:./cockroach-default
18594246 [16:10:02] : [Step 2/3] ssh: connect to host 104.196.185.100 port 22: Connection timed out
18594247 [16:10:02] : [Step 2/3] : exit status 1

This is not ideal for at least a couple of reasons:

  1. When the build fails, it's not immediately clear why it failed. One has to look through really long log files (in the case of [1], a whopping 2.6GB log file) to find the underlying error. In addition, when a worker fails, it waits for all other workers to finish running the test they are currently running (which could take hours); this means that the error that caused the roachtest process to exit might be several thousand lines before the process actually quits.
  2. Failing to upload a file to a cluster shouldn't cause the entire TC build to halt. We should treat these failures just like we do cluster_creation errors.

Related to the second point above, a possible way to fix this issue would be to consider errors uploading initial files [2] as clusterCreateErr.

[1] https://teamcity.cockroachdb.com/viewLog.html?buildId=11429784&buildTypeId=Cockroach_Nightlies_RoachtestNightlyGceBazel
[2]

if c.spec.NodeCount > 0 { // skip during tests
err = c.PutDefaultCockroach(ctx, l, t.Cockroach())
}
if err == nil {
err = c.PutLibraries(ctx, "./lib", t.spec.NativeLibs)
}

Jira issue: CRDB-30854

@renatolabs renatolabs added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure T-testeng TestEng Team labels Aug 22, 2023
@blathers-crl
Copy link

blathers-crl bot commented Aug 22, 2023

cc @cockroachdb/test-eng

renatolabs added a commit to renatolabs/cockroach that referenced this issue Aug 24, 2023
roachtest will exit with code 11 if creating any clusters during a
test run failed. However, that is not ideal for a few reasons:

* Cluster creation often fails, partly because of temporary
unavailability of a resource type in a data center; and partly because
of issues in roachtest itself (see cockroachdb#104029).
* Exiting with code 11 causes the build to be marked and reported as a
failrue on TeamCity/Slack and that's disruptive. We already get
cluster creation failure notifications on GitHub. By reporting them as
build failures on TeamCity, we mask actually serious issues like the
test runner crashing in the middle of the build and not running every
test (for a recent example, see cockroachdb#109279).

For these reasons, this commit updates the script used by TeamCity to
invoke roachtest to also ignore exit code 11 (just like it currently
does for exit code 10). This makes roachtest build failures stand out
more, as they will mean roachtest was unable to run all tests.

Epic: none

Release note: None
craig bot pushed a commit that referenced this issue Aug 25, 2023
109463: build/roachtest: do not exit with code 11 on cluster creation failure r=srosenberg,herkolategan a=renatolabs

roachtest will exit with code 11 if creating any clusters during a test run failed. However, that is not ideal for a few reasons:

* Cluster creation often fails, partly because of temporary unavailability of a resource type in a data center; and partly because of issues in roachtest itself (see #104029).
* Exiting with code 11 causes the build to be marked and reported as a failrue on TeamCity/Slack and that's disruptive. We already get cluster creation failure notifications on GitHub. By reporting them as build failures on TeamCity, we mask actually serious issues like the test runner crashing in the middle of the build and not running every test (for a recent example, see #109279).

For these reasons, this commit updates the script used by TeamCity to invoke roachtest to also ignore exit code 11 (just like it currently does for exit code 10). This makes roachtest build failures stand out more, as they will mean roachtest was unable to run all tests.

Epic: none

Release note: None

Co-authored-by: Renato Costa <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Aug 25, 2023
roachtest will exit with code 11 if creating any clusters during a
test run failed. However, that is not ideal for a few reasons:

* Cluster creation often fails, partly because of temporary
unavailability of a resource type in a data center; and partly because
of issues in roachtest itself (see #104029).
* Exiting with code 11 causes the build to be marked and reported as a
failrue on TeamCity/Slack and that's disruptive. We already get
cluster creation failure notifications on GitHub. By reporting them as
build failures on TeamCity, we mask actually serious issues like the
test runner crashing in the middle of the build and not running every
test (for a recent example, see #109279).

For these reasons, this commit updates the script used by TeamCity to
invoke roachtest to also ignore exit code 11 (just like it currently
does for exit code 10). This makes roachtest build failures stand out
more, as they will mean roachtest was unable to run all tests.

Epic: none

Release note: None
renatolabs added a commit to renatolabs/cockroach that referenced this issue Aug 31, 2023
After a cluster is created (or reused) and is ready to run a test, the
test runner will upload initial files (such as a cockroach binary and
required libraries) to all nodes. When this process fails (typically
because of SSH flakes), the worker returns an error, causing the
entire test suite to terminate early before all tests had a chance to
run.

This is not ideal since these SSH flakes are intermittent and should
not cause the test runner to stop. In this commit, we unify the
handling of these errors with handling creation failures: the error is
marked as a "cluster creation error" flake, causing it to be reported
as such. Crucially, the roachtest worker will not return an error and
other tests will continue to run.

Fixes: cockroachdb#109279

Release note: None
craig bot pushed a commit that referenced this issue Sep 1, 2023
109809: rangefeed: deflake TestBudgetReleaseOnProcessorStop r=erikgrinaker a=pavelkalinnikov

Fixes #109802
Epic: none
Release note: none

109851: roachtest: treat cluster setup errros the same as creation errors r=srosenberg a=renatolabs

After a cluster is created (or reused) and is ready to run a test, the test runner will upload initial files (such as a cockroach binary and required libraries) to all nodes. When this process fails (typically because of SSH flakes), the worker returns an error, causing the entire test suite to terminate early before all tests had a chance to run.

This is not ideal since these SSH flakes are intermittent and should not cause the test runner to stop. In this commit, we unify the handling of these errors with handling creation failures: the error is marked as a "cluster creation error" flake, causing it to be reported as such. Crucially, the roachtest worker will not return an error and other tests will continue to run.

Fixes: #109279

Release note: None

Co-authored-by: Pavel Kalinnikov <[email protected]>
Co-authored-by: Renato Costa <[email protected]>
@craig craig bot closed this as completed in 35e7946 Sep 1, 2023
blathers-crl bot pushed a commit that referenced this issue Sep 1, 2023
After a cluster is created (or reused) and is ready to run a test, the
test runner will upload initial files (such as a cockroach binary and
required libraries) to all nodes. When this process fails (typically
because of SSH flakes), the worker returns an error, causing the
entire test suite to terminate early before all tests had a chance to
run.

This is not ideal since these SSH flakes are intermittent and should
not cause the test runner to stop. In this commit, we unify the
handling of these errors with handling creation failures: the error is
marked as a "cluster creation error" flake, causing it to be reported
as such. Crucially, the roachtest worker will not return an error and
other tests will continue to run.

Fixes: #109279

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-testeng TestEng Team
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant