-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: failing to upload initial binaries or libraries causes entire test suite to fail #109279
Comments
renatolabs
added
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
A-testing
Testing tools and infrastructure
T-testeng
TestEng Team
labels
Aug 22, 2023
cc @cockroachdb/test-eng |
renatolabs
added a commit
to renatolabs/cockroach
that referenced
this issue
Aug 24, 2023
roachtest will exit with code 11 if creating any clusters during a test run failed. However, that is not ideal for a few reasons: * Cluster creation often fails, partly because of temporary unavailability of a resource type in a data center; and partly because of issues in roachtest itself (see cockroachdb#104029). * Exiting with code 11 causes the build to be marked and reported as a failrue on TeamCity/Slack and that's disruptive. We already get cluster creation failure notifications on GitHub. By reporting them as build failures on TeamCity, we mask actually serious issues like the test runner crashing in the middle of the build and not running every test (for a recent example, see cockroachdb#109279). For these reasons, this commit updates the script used by TeamCity to invoke roachtest to also ignore exit code 11 (just like it currently does for exit code 10). This makes roachtest build failures stand out more, as they will mean roachtest was unable to run all tests. Epic: none Release note: None
craig bot
pushed a commit
that referenced
this issue
Aug 25, 2023
109463: build/roachtest: do not exit with code 11 on cluster creation failure r=srosenberg,herkolategan a=renatolabs roachtest will exit with code 11 if creating any clusters during a test run failed. However, that is not ideal for a few reasons: * Cluster creation often fails, partly because of temporary unavailability of a resource type in a data center; and partly because of issues in roachtest itself (see #104029). * Exiting with code 11 causes the build to be marked and reported as a failrue on TeamCity/Slack and that's disruptive. We already get cluster creation failure notifications on GitHub. By reporting them as build failures on TeamCity, we mask actually serious issues like the test runner crashing in the middle of the build and not running every test (for a recent example, see #109279). For these reasons, this commit updates the script used by TeamCity to invoke roachtest to also ignore exit code 11 (just like it currently does for exit code 10). This makes roachtest build failures stand out more, as they will mean roachtest was unable to run all tests. Epic: none Release note: None Co-authored-by: Renato Costa <[email protected]>
blathers-crl bot
pushed a commit
that referenced
this issue
Aug 25, 2023
roachtest will exit with code 11 if creating any clusters during a test run failed. However, that is not ideal for a few reasons: * Cluster creation often fails, partly because of temporary unavailability of a resource type in a data center; and partly because of issues in roachtest itself (see #104029). * Exiting with code 11 causes the build to be marked and reported as a failrue on TeamCity/Slack and that's disruptive. We already get cluster creation failure notifications on GitHub. By reporting them as build failures on TeamCity, we mask actually serious issues like the test runner crashing in the middle of the build and not running every test (for a recent example, see #109279). For these reasons, this commit updates the script used by TeamCity to invoke roachtest to also ignore exit code 11 (just like it currently does for exit code 10). This makes roachtest build failures stand out more, as they will mean roachtest was unable to run all tests. Epic: none Release note: None
renatolabs
added a commit
to renatolabs/cockroach
that referenced
this issue
Aug 31, 2023
After a cluster is created (or reused) and is ready to run a test, the test runner will upload initial files (such as a cockroach binary and required libraries) to all nodes. When this process fails (typically because of SSH flakes), the worker returns an error, causing the entire test suite to terminate early before all tests had a chance to run. This is not ideal since these SSH flakes are intermittent and should not cause the test runner to stop. In this commit, we unify the handling of these errors with handling creation failures: the error is marked as a "cluster creation error" flake, causing it to be reported as such. Crucially, the roachtest worker will not return an error and other tests will continue to run. Fixes: cockroachdb#109279 Release note: None
craig bot
pushed a commit
that referenced
this issue
Sep 1, 2023
109809: rangefeed: deflake TestBudgetReleaseOnProcessorStop r=erikgrinaker a=pavelkalinnikov Fixes #109802 Epic: none Release note: none 109851: roachtest: treat cluster setup errros the same as creation errors r=srosenberg a=renatolabs After a cluster is created (or reused) and is ready to run a test, the test runner will upload initial files (such as a cockroach binary and required libraries) to all nodes. When this process fails (typically because of SSH flakes), the worker returns an error, causing the entire test suite to terminate early before all tests had a chance to run. This is not ideal since these SSH flakes are intermittent and should not cause the test runner to stop. In this commit, we unify the handling of these errors with handling creation failures: the error is marked as a "cluster creation error" flake, causing it to be reported as such. Crucially, the roachtest worker will not return an error and other tests will continue to run. Fixes: #109279 Release note: None Co-authored-by: Pavel Kalinnikov <[email protected]> Co-authored-by: Renato Costa <[email protected]>
blathers-crl bot
pushed a commit
that referenced
this issue
Sep 1, 2023
After a cluster is created (or reused) and is ready to run a test, the test runner will upload initial files (such as a cockroach binary and required libraries) to all nodes. When this process fails (typically because of SSH flakes), the worker returns an error, causing the entire test suite to terminate early before all tests had a chance to run. This is not ideal since these SSH flakes are intermittent and should not cause the test runner to stop. In this commit, we unify the handling of these errors with handling creation failures: the error is marked as a "cluster creation error" flake, causing it to be reported as such. Crucially, the roachtest worker will not return an error and other tests will continue to run. Fixes: #109279 Release note: None
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In a recent nightly build [1], we observed that the test suite terminated abruptly before all tests had a chance to run. The reason for this is that one of the workers failed to upload the default
cockroach
binary to the cluster that was going to be used in a test run:This is not ideal for at least a couple of reasons:
cluster_creation
errors.Related to the second point above, a possible way to fix this issue would be to consider errors uploading initial files [2] as
clusterCreateErr
.[1] https://teamcity.cockroachdb.com/viewLog.html?buildId=11429784&buildTypeId=Cockroach_Nightlies_RoachtestNightlyGceBazel
[2]
cockroach/pkg/cmd/roachtest/test_runner.go
Lines 755 to 760 in 236f70a
Jira issue: CRDB-30854
The text was updated successfully, but these errors were encountered: