-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: ClusterSpec should support user-specified clouds _compatible_ with a test #104029
Comments
cc @cockroachdb/test-eng |
Tagging (pun intended) relevant issues,
Based on an earlier discussion within TestEng, tags shouldn't be used for capturing (compatibility) semantics; they are merely metadata for labeling/organizing (e.g., grouping) roachtests. |
roachtest will exit with code 11 if creating any clusters during a test run failed. However, that is not ideal for a few reasons: * Cluster creation often fails, partly because of temporary unavailability of a resource type in a data center; and partly because of issues in roachtest itself (see cockroachdb#104029). * Exiting with code 11 causes the build to be marked and reported as a failrue on TeamCity/Slack and that's disruptive. We already get cluster creation failure notifications on GitHub. By reporting them as build failures on TeamCity, we mask actually serious issues like the test runner crashing in the middle of the build and not running every test (for a recent example, see cockroachdb#109279). For these reasons, this commit updates the script used by TeamCity to invoke roachtest to also ignore exit code 11 (just like it currently does for exit code 10). This makes roachtest build failures stand out more, as they will mean roachtest was unable to run all tests. Epic: none Release note: None
109463: build/roachtest: do not exit with code 11 on cluster creation failure r=srosenberg,herkolategan a=renatolabs roachtest will exit with code 11 if creating any clusters during a test run failed. However, that is not ideal for a few reasons: * Cluster creation often fails, partly because of temporary unavailability of a resource type in a data center; and partly because of issues in roachtest itself (see #104029). * Exiting with code 11 causes the build to be marked and reported as a failrue on TeamCity/Slack and that's disruptive. We already get cluster creation failure notifications on GitHub. By reporting them as build failures on TeamCity, we mask actually serious issues like the test runner crashing in the middle of the build and not running every test (for a recent example, see #109279). For these reasons, this commit updates the script used by TeamCity to invoke roachtest to also ignore exit code 11 (just like it currently does for exit code 10). This makes roachtest build failures stand out more, as they will mean roachtest was unable to run all tests. Epic: none Release note: None Co-authored-by: Renato Costa <[email protected]>
roachtest will exit with code 11 if creating any clusters during a test run failed. However, that is not ideal for a few reasons: * Cluster creation often fails, partly because of temporary unavailability of a resource type in a data center; and partly because of issues in roachtest itself (see #104029). * Exiting with code 11 causes the build to be marked and reported as a failrue on TeamCity/Slack and that's disruptive. We already get cluster creation failure notifications on GitHub. By reporting them as build failures on TeamCity, we mask actually serious issues like the test runner crashing in the middle of the build and not running every test (for a recent example, see #109279). For these reasons, this commit updates the script used by TeamCity to invoke roachtest to also ignore exit code 11 (just like it currently does for exit code 10). This makes roachtest build failures stand out more, as they will mean roachtest was unable to run all tests. Epic: none Release note: None
I think that if there is a separate "cloud compatibility" field in the test spec (as proposed in #100605), the The "cloud compatibility" field could also live inside |
Logically, it makes sense. However, we may need to refactor |
Another super confusing thing is that the registry "leaks" the cloud that we are using through the cockroach/pkg/cmd/roachtest/tests/tpcc.go Line 538 in c34dd5f
cockroach/pkg/cmd/roachtest/tests/tpcc.go Line 862 in c34dd5f
This is crazy; test registration should not depend on the flags. If we need different parameters, we should simply define two different tests (or have the test decide on the parameter once it's running). |
Yes. As a first step, we should make sure none of the code registering tests looks at the |
This commit cleans up the tpcc code to not look at the cloud (leaked through `TestSpec`) during registration. Instead, we define both GCE and AWS values in the spec and decide between them when the test is run. Informs cockroachdb#104029 Release note: None
111285: roachtest: tpcc: don't look at cloud during registration r=RaduBerinde a=RaduBerinde This commit cleans up the tpcc code to not look at the cloud (leaked through `TestSpec`) during registration. Instead, we define both GCE and AWS values in the spec and decide between them when the test is run. Informs #104029 Release note: None Co-authored-by: Radu Berinde <[email protected]>
This commit cleans up the tpcc code to not look at the cloud (leaked through `TestSpec`) during registration. Instead, we define both GCE and AWS values in the spec and decide between them when the test is run. Informs cockroachdb#104029 Release note: None
This change removes all remaining uses of `ClusterSpec.Cloud` except those internal to roachtest. Code that is part of running a test now uses `Cluster.Cloud()` instead. Informs: cockroachdb#104029 Release note: None
This change removes all remaining uses of `ClusterSpec.Cloud` except those internal to roachtest. Code that is part of running a test now uses `Cluster.Cloud()` instead. Informs: cockroachdb#104029 Release note: None
111811: spec: move machine type, zone, local ssd defaults out of the TestSpec and the registry r=RaduBerinde a=RaduBerinde This set of commits makes more progress towards #104029 and - more generally - not baking in any flag configuration into the registry itself. #### spec: move RoachprodOpts args to separate struct Epic: none Release note: None #### spec: move default machine type from ClusterSpec to RoachprodClusterConfig This is much more logical and allows the removal of the parameter from the registry. Epic: none Release note: None #### spec: move default zones to RoachprodClusterConfig Remove the default zones from `ClusterSpec` (and the registry), and add it to `RoachprodClusterConfig`. Epic: none Release note: None #### spec: move local SSD preference to RoachprodClusterConfig We change the boolean in the TestSpec to a tri-state (default, prefer on, disable). This way we can apply the default when creating the cluster. Epic: none Release note: None Co-authored-by: Radu Berinde <[email protected]>
This commit cleans up the tpcc code to not look at the cloud (leaked through `TestSpec`) during registration. Instead, we define both GCE and AWS values in the spec and decide between them when the test is run. Informs cockroachdb#104029 Release note: None
This change is the last step in removing runtime state from the test registry and the cluster spec. The cloud is no longer accessible at test registration time. Fixes cockroachdb#104029 Release note: None
113301: roachtest: allow tests to specify a cockroach binary to use r=renatolabs,RaduBerinde a=DarrylWong Currently, roachtests must manually upload their own cockroach binary if needed through the Put API. However, almost all roachtests upload the standard t.Cockroach() binary to ./cockroach on all nodes, resulting in the same Put code being duplicated at the start of most tests. Additionally, to collect artifacts we still need a cockroach binary at a discoverable path, leading to the same binary being copied twice in many cases, see: #97814 This change adds a TestSpec option which lets tests specify a cockroach binary to use. If one is not specified, we now upload the t.Cockroach() binary to ./cockroach. This lets us remove cockroach-default logic for artifacts, and removes the need to manually upload binaries at the start of each test. Release note: None Fixes: #104729 113505: roachtest: remove cloud from registry and ClusterSpec r=RaduBerinde a=RaduBerinde This change is the last step in removing runtime state from the test registry and the cluster spec. The cloud is no longer accessible at test registration time. Fixes #104029 Release note: None Co-authored-by: DarrylWong <[email protected]> Co-authored-by: Radu Berinde <[email protected]>
This change is the last step in removing runtime state from the test registry and the cluster spec. The cloud is no longer accessible at test registration time. Fixes cockroachdb#104029 Release note: None
This change removes all remaining uses of `ClusterSpec.Cloud` except those internal to roachtest. Code that is part of running a test now uses `Cluster.Cloud()` instead. Informs: cockroachdb#104029 Release note: None
This change is the last step in removing runtime state from the test registry and the cluster spec. The cloud is no longer accessible at test registration time. Fixes cockroachdb#104029 Release note: None
Currently, roachtests don't have an established mechanism for specifying a set of cloud providers which are compatible with a given test. Theoretically, a roachtest should be cloud-agnostic since it doesn't directly interact with cloud APIs, a task that's delegated to roachprod. In practice, several roachtests may in fact be incompatible with a set of cloud providers. E.g.,
schemachange/mixed-versions-compat
usesgsutil
to copy corpus data from a GCS bucket [1]restore
variant uses AWS-specific zones [2]clearrange
andyscb/A
use ZFS, available only in GCE [3], [4], [5]Note, while
gsutil
in the first example may seem like a superficial incompatibility, in reality large-scale backup/restore tests may induce large egress if data is pulled from a cloud provider, different from where the test is scheduled to execute. Hence, we need to ensure, either data is sufficiently replicated (i.e., local to the test's cloud provider), or the test is specified to be incompatible with the cloud providers which lack the required test data (fixtures).In practice, there may be additional, albeit rare reasons for incompatibility; e.g., quota, price, specific machine type, etc.
Consequently, there must be an established mechanism, both for specifying when a test is incompatible with a cloud provider, and for skipping the test from executing against all incompatible cloud providers. Currently,
ClusterSpec.Cloud
denotes the cloud provider that's been provided viaroachtest run --cloud
, not a compatible cloud provider, specified by the test author. That is, the framework makes an implicit (and wrong) assumption that every test should be executable againstClusterSpec.Cloud
. As a further confounding factor, CI usestags
to select roachtests per given cloud provider [6].As for skipping incompatible tests, test authors came up with ad hoc workarounds, e.g., [7], [8], thereby complicating both the setup logic, as well as, future refactoring.
[1]
cockroach/pkg/cmd/roachtest/tests/mixed_version_decl_schemachange_compat.go
Lines 67 to 70 in 87c6775
[2]
cockroach/pkg/cmd/roachtest/tests/restore.go
Lines 299 to 304 in 87c6775
[3]
cockroach/pkg/cmd/roachtest/tests/clearrange.go
Line 56 in 87c6775
[4]
cockroach/pkg/cmd/roachtest/tests/ycsb.go
Line 118 in 87c6775
[5]
cockroach/pkg/cmd/roachtest/spec/cluster_spec.go
Lines 271 to 274 in 87c6775
[6]
cockroach/build/teamcity/util/roachtest_util.sh
Lines 66 to 77 in 87c6775
[7]
cockroach/pkg/cmd/roachtest/tests/ycsb.go
Lines 50 to 53 in 87c6775
[8]
cockroach/pkg/cmd/roachtest/tests/restore.go
Lines 725 to 727 in 87c6775
Jira issue: CRDB-28322
The text was updated successfully, but these errors were encountered: