Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: ClusterSpec should support user-specified clouds _compatible_ with a test #104029

Closed
srosenberg opened this issue May 29, 2023 · 6 comments · Fixed by #113505
Closed
Assignees
Labels
A-testing Testing tools and infrastructure C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-testeng TestEng Team

Comments

@srosenberg
Copy link
Member

srosenberg commented May 29, 2023

Currently, roachtests don't have an established mechanism for specifying a set of cloud providers which are compatible with a given test. Theoretically, a roachtest should be cloud-agnostic since it doesn't directly interact with cloud APIs, a task that's delegated to roachprod. In practice, several roachtests may in fact be incompatible with a set of cloud providers. E.g.,

  • schemachange/mixed-versions-compat uses gsutil to copy corpus data from a GCS bucket [1]
  • restore variant uses AWS-specific zones [2]
  • variants of clearrange and yscb/A use ZFS, available only in GCE [3], [4], [5]

Note, while gsutil in the first example may seem like a superficial incompatibility, in reality large-scale backup/restore tests may induce large egress if data is pulled from a cloud provider, different from where the test is scheduled to execute. Hence, we need to ensure, either data is sufficiently replicated (i.e., local to the test's cloud provider), or the test is specified to be incompatible with the cloud providers which lack the required test data (fixtures).
In practice, there may be additional, albeit rare reasons for incompatibility; e.g., quota, price, specific machine type, etc.

Consequently, there must be an established mechanism, both for specifying when a test is incompatible with a cloud provider, and for skipping the test from executing against all incompatible cloud providers. Currently, ClusterSpec.Cloud denotes the cloud provider that's been provided via roachtest run --cloud, not a compatible cloud provider, specified by the test author. That is, the framework makes an implicit (and wrong) assumption that every test should be executable against ClusterSpec.Cloud. As a further confounding factor, CI uses tags to select roachtests per given cloud provider [6].

As for skipping incompatible tests, test authors came up with ad hoc workarounds, e.g., [7], [8], thereby complicating both the setup logic, as well as, future refactoring.

[1]

err = c.RunE(ctx, c.Node(1),
fmt.Sprintf(" gsutil cp gs://cockroach-corpus/corpus-%s/corpus %s",
version,
corpusFilePath))

[2]
hardware: makeHardwareSpecs(hardwareSpecs{
nodes: 9,
zones: []string{"us-east-2b", "us-west-2b", "eu-west-1b"}}), // These zones are AWS-specific.
backup: makeBackupSpecs(backupSpecs{}),
timeout: 90 * time.Minute,
tags: registry.Tags("aws"),

[3]
Cluster: r.MakeClusterSpec(10, spec.CPU(16), spec.SetFileSystem(spec.Zfs)),

[4]
Cluster: r.MakeClusterSpec(4, spec.CPU(cpus), spec.SetFileSystem(spec.Zfs)),

[5]
if s.Cloud != GCE {
return vm.CreateOpts{}, nil, errors.Errorf(
"node creation with zfs file system not yet supported on %s", s.Cloud,
)

[6]
gce)
# Confusing due to how we've handled tags in the past where it has been assumed that all tests should
# be run on GCE. Now with refactoring of how tags are handled, we need:
# - "default" to ensure we select tests that don't have any user specified tags (preserve old behavior)
# - "aws" to ensure we select tests that now no longer have "default" because they have the "aws" tag
# Ideally, refactor the tags themselves to be explicit about what cloud they are for and when they can run.
# https://github.com/cockroachdb/cockroach/issues/100605
FILTER="tag:aws tag:default"
;;
aws)
if [ -z "${FILTER}" ]; then
FILTER="tag:aws"

[7]
// For now, we only want to run the zfs tests on GCE, since only GCE supports
// starting roachprod instances on zfs.
if c.Spec().FileSystem == spec.Zfs && c.Spec().Cloud != spec.GCE {
t.Skip("YCSB zfs benchmark can only be run on GCE", "")

[8]
if rd.c.Spec().Cloud != rd.sp.backup.cloud {
// For now, only run the test on the cloud provider that also stores the backup.
rd.t.Skip("test configured to run on %s", rd.sp.backup.cloud)

Jira issue: CRDB-28322

@srosenberg srosenberg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure T-testeng TestEng Team labels May 29, 2023
@blathers-crl
Copy link

blathers-crl bot commented May 29, 2023

cc @cockroachdb/test-eng

@srosenberg
Copy link
Member Author

srosenberg commented May 29, 2023

Tagging (pun intended) relevant issues,

Based on an earlier discussion within TestEng, tags shouldn't be used for capturing (compatibility) semantics; they are merely metadata for labeling/organizing (e.g., grouping) roachtests.

renatolabs added a commit to renatolabs/cockroach that referenced this issue Aug 24, 2023
roachtest will exit with code 11 if creating any clusters during a
test run failed. However, that is not ideal for a few reasons:

* Cluster creation often fails, partly because of temporary
unavailability of a resource type in a data center; and partly because
of issues in roachtest itself (see cockroachdb#104029).
* Exiting with code 11 causes the build to be marked and reported as a
failrue on TeamCity/Slack and that's disruptive. We already get
cluster creation failure notifications on GitHub. By reporting them as
build failures on TeamCity, we mask actually serious issues like the
test runner crashing in the middle of the build and not running every
test (for a recent example, see cockroachdb#109279).

For these reasons, this commit updates the script used by TeamCity to
invoke roachtest to also ignore exit code 11 (just like it currently
does for exit code 10). This makes roachtest build failures stand out
more, as they will mean roachtest was unable to run all tests.

Epic: none

Release note: None
craig bot pushed a commit that referenced this issue Aug 25, 2023
109463: build/roachtest: do not exit with code 11 on cluster creation failure r=srosenberg,herkolategan a=renatolabs

roachtest will exit with code 11 if creating any clusters during a test run failed. However, that is not ideal for a few reasons:

* Cluster creation often fails, partly because of temporary unavailability of a resource type in a data center; and partly because of issues in roachtest itself (see #104029).
* Exiting with code 11 causes the build to be marked and reported as a failrue on TeamCity/Slack and that's disruptive. We already get cluster creation failure notifications on GitHub. By reporting them as build failures on TeamCity, we mask actually serious issues like the test runner crashing in the middle of the build and not running every test (for a recent example, see #109279).

For these reasons, this commit updates the script used by TeamCity to invoke roachtest to also ignore exit code 11 (just like it currently does for exit code 10). This makes roachtest build failures stand out more, as they will mean roachtest was unable to run all tests.

Epic: none

Release note: None

Co-authored-by: Renato Costa <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Aug 25, 2023
roachtest will exit with code 11 if creating any clusters during a
test run failed. However, that is not ideal for a few reasons:

* Cluster creation often fails, partly because of temporary
unavailability of a resource type in a data center; and partly because
of issues in roachtest itself (see #104029).
* Exiting with code 11 causes the build to be marked and reported as a
failrue on TeamCity/Slack and that's disruptive. We already get
cluster creation failure notifications on GitHub. By reporting them as
build failures on TeamCity, we mask actually serious issues like the
test runner crashing in the middle of the build and not running every
test (for a recent example, see #109279).

For these reasons, this commit updates the script used by TeamCity to
invoke roachtest to also ignore exit code 11 (just like it currently
does for exit code 10). This makes roachtest build failures stand out
more, as they will mean roachtest was unable to run all tests.

Epic: none

Release note: None
@RaduBerinde
Copy link
Member

I think that if there is a separate "cloud compatibility" field in the test spec (as proposed in #100605), the ClusterSpec in the test registry should not have a Cloud field at all. Most of the parameters should apply to any cloud. The ClusterSpec can still contain cloud-specific parameters (for one or more clouds), with the logic: "when cluster is on cloud X, use these cloud-specific options".

The "cloud compatibility" field could also live inside CluserSpec (replacing Cloud), but it feels more like a top-level test property given that we want run to automatically filter tests in a suite based on the cloud.

@srosenberg
Copy link
Member Author

I think that if there is a separate "cloud compatibility" field in the test spec (as proposed in #100605), the ClusterSpec in the test registry should not have a Cloud field at all.

Logically, it makes sense. However, we may need to refactor clusterImpl; it's currently agnostic to TestSpec, but it needs to differentiate between local and actual cloud, in some cases.

@RaduBerinde
Copy link
Member

RaduBerinde commented Sep 14, 2023

Another super confusing thing is that the registry "leaks" the cloud that we are using through the ClusterSpec and some registration code uses it to tweak the test settings depending on the cloud:

cloud := r.MakeClusterSpec(1).Cloud

EstimatedMax: gceOrAws(cloud, 750, 900),

This is crazy; test registration should not depend on the flags. If we need different parameters, we should simply define two different tests (or have the test decide on the parameter once it's running).

@RaduBerinde
Copy link
Member

Yes. As a first step, we should make sure none of the code registering tests looks at the .Cloud field. We should be able to do that once we have a cloud compatibility field.

RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue Sep 26, 2023
This commit cleans up the tpcc code to not look at the cloud (leaked
through `TestSpec`) during registration. Instead, we define both GCE
and AWS values in the spec and decide between them when the test is
run.

Informs cockroachdb#104029
Release note: None
@RaduBerinde RaduBerinde self-assigned this Sep 26, 2023
craig bot pushed a commit that referenced this issue Sep 27, 2023
111285: roachtest: tpcc: don't look at cloud during registration r=RaduBerinde a=RaduBerinde

This commit cleans up the tpcc code to not look at the cloud (leaked through `TestSpec`) during registration. Instead, we define both GCE and AWS values in the spec and decide between them when the test is run.

Informs #104029
Release note: None

Co-authored-by: Radu Berinde <[email protected]>
michae2 pushed a commit to michae2/cockroach that referenced this issue Sep 27, 2023
This commit cleans up the tpcc code to not look at the cloud (leaked
through `TestSpec`) during registration. Instead, we define both GCE
and AWS values in the spec and decide between them when the test is
run.

Informs cockroachdb#104029
Release note: None
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue Sep 30, 2023
This change removes all remaining uses of `ClusterSpec.Cloud` except
those internal to roachtest. Code that is part of running a test now
uses `Cluster.Cloud()` instead.

Informs: cockroachdb#104029
Release note: None
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue Oct 3, 2023
This change removes all remaining uses of `ClusterSpec.Cloud` except
those internal to roachtest. Code that is part of running a test now
uses `Cluster.Cloud()` instead.

Informs: cockroachdb#104029
Release note: None
craig bot pushed a commit that referenced this issue Oct 15, 2023
111811: spec: move machine type, zone, local ssd defaults out of the TestSpec and the registry r=RaduBerinde a=RaduBerinde

This set of commits makes more progress towards #104029 and - more generally - not baking in any flag configuration into the registry itself.

#### spec: move RoachprodOpts args to separate struct

Epic: none
Release note: None

#### spec: move default machine type from ClusterSpec to RoachprodClusterConfig

This is much more logical and allows the removal of the parameter from
the registry.

Epic: none
Release note: None

#### spec: move default zones to RoachprodClusterConfig

Remove the default zones from `ClusterSpec` (and the registry), and
add it to `RoachprodClusterConfig`.

Epic: none
Release note: None

#### spec: move local SSD preference to RoachprodClusterConfig

We change the boolean in the TestSpec to a tri-state (default, prefer
on, disable). This way we can apply the default when creating the
cluster.

Epic: none
Release note: None

Co-authored-by: Radu Berinde <[email protected]>
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue Oct 17, 2023
This commit cleans up the tpcc code to not look at the cloud (leaked
through `TestSpec`) during registration. Instead, we define both GCE
and AWS values in the spec and decide between them when the test is
run.

Informs cockroachdb#104029
Release note: None
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue Nov 1, 2023
This change is the last step in removing runtime state from the test
registry and the cluster spec. The cloud is no longer accessible at
test registration time.

Fixes cockroachdb#104029
Release note: None
craig bot pushed a commit that referenced this issue Nov 8, 2023
113301: roachtest: allow tests to specify a cockroach binary to use  r=renatolabs,RaduBerinde a=DarrylWong

Currently, roachtests must manually upload their own cockroach binary if needed through the Put API.
However, almost all roachtests upload the standard t.Cockroach() binary to ./cockroach on all nodes,
resulting in the same Put code being duplicated at the start of most tests. Additionally, to collect artifacts
we still need a cockroach binary at a discoverable path, leading to the same binary being copied twice in many
cases, see: #97814

This change adds a TestSpec option which lets tests specify a cockroach binary to use. If one is not specified,
we now upload the t.Cockroach() binary to ./cockroach. This lets us remove cockroach-default logic for artifacts,
and removes the need to manually upload binaries at the start of each test.

Release note: None
Fixes: #104729

113505: roachtest: remove cloud from registry and ClusterSpec r=RaduBerinde a=RaduBerinde

This change is the last step in removing runtime state from the test
registry and the cluster spec. The cloud is no longer accessible at
test registration time.

Fixes #104029
Release note: None

Co-authored-by: DarrylWong <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
@craig craig bot closed this as completed in c6a8e62 Nov 8, 2023
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue Nov 15, 2023
This change is the last step in removing runtime state from the test
registry and the cluster spec. The cloud is no longer accessible at
test registration time.

Fixes cockroachdb#104029
Release note: None
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue Dec 1, 2023
This change removes all remaining uses of `ClusterSpec.Cloud` except
those internal to roachtest. Code that is part of running a test now
uses `Cluster.Cloud()` instead.

Informs: cockroachdb#104029
Release note: None
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue Jan 10, 2024
This change is the last step in removing runtime state from the test
registry and the cluster spec. The cloud is no longer accessible at
test registration time.

Fixes cockroachdb#104029
Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-testeng TestEng Team
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants