Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-22.2: roachtest: metamorphic ARM64 and FIPS clusters #104691

Merged
merged 2 commits into from
Jun 14, 2023

Conversation

srosenberg
Copy link
Member

@srosenberg srosenberg commented Jun 10, 2023

Backport 2/2 commits from #103710.

/cc @cockroachdb/release


Previously, all roachtests used (cloud) machine types with the AMD64 (cpu) architecture. Recently [1], new CI infrastructure was added to run a clone of all the nightly roachtests, configured with FIPS; i.e., same AMD64 machine types, different AMI and crdb binary, patched with FIPS-certified openssl native code.

As of this PR, we add the capability to execute any roachtest in a cluster, configured with either
ARM64, FIPS, or AMD64 (default). This is controlled via the two CLI args: metamorphic-arm64-probability and metamorphic-fips-probability. The former denotes the probability (over the uniform distribution) of a new cluster provisioned using ARM64 VMs. The latter denotes the probability of a new AMD64 cluster provisioned with the FIPS-compliant (kernel) configuration.
In case a test is compatible only with AMD64, it's effectively excluded from the set; i.e., both
probabilities apply to compatible tests only.

Note, the two probabilties don't have to add up to 1. E.g., metamorphic-arm64-probability==0.4,
metamorphic-fips-probability==0.2 denotes that ARM64 clusters are chosen ~40% of the time, whereas of the remaining ~60% AMD clusters, FIPS is chosen ~20%
of the time; i.e., ~12% of all clusters will use FIPS.

Note, the values '0' and '1' are absolute. Setting both
to '0' is tantamount to the behavior before this PR.
Setting either to '1' enforces all clusters
are provisioned with either ARM64 or FIPS.
A test can specify its required architecture, in which
case, it takes precedence over metamorphic settings.

This PR builds on [1], which enabled ARM64 provisioning for AWS in roachprod. We add ARM64 provisioning for GCE, i.e., T2A, as well as refactor 'arch' argument to
denote one of: AMD64, ARM64, FIPS, where the latter isn't formally a CPU architecture; however, it simplifies provisioning and binary staging.
We also modify roachprod.List to display CPU architecture, other than AMD64, with the machine type; this should make it easier to see which clusters are running ARM64 and FIPS configurations, as we ramp up their testing.

Epic: none
Release note: None
Release justification: ci/test only change

Resolves: #94957
Informs: #94986

[1] #99224
[2] #103243

Previously, roachtests which benchmark performance (cf. correctness)
were indistinguishable from correctness tests. That is, a performance
test is like any other test with the exception of _optionally_ writing
stats.json under 'Test.PerfArtifactsDir'; these artifacts are
automatically exported to a gcs bucket, used in conjunction with the
roachperf dashboard.

Having no direct way to distinguish a performance test from a correctness
test has several challenges. E.g., performance tests may require a specific
machine type or architecture; background workloads like incremental backup
may cause a performance regression; new metamorphic configurations like
arm64 and fips may require a "bake-in" time before performance tests
can be enabled. In future, the test runner may make specialized
decisions (e.g., don't reuse a cluster) when executing a performance
test. Thus, we need a (standard) mechanism to enumerate all performance tests.
Given their specific requirements, the test author must explicitly opt in,
by setting TestSpec.Benchmark to 'true'.

This PR applies the above change retroactively, i.e., setting 'TestSpec.Benchmark'
for all _known_ performance tests, including those which _assert_ on performance
instead of exporting stats.json.
It also fixes `roachtest list --bench` and `roachtest bench`, which were
out-of-date, albeit not actively used.

Epic: none
Release note: None
@srosenberg srosenberg requested review from a team as code owners June 10, 2023 01:14
@srosenberg srosenberg requested review from herkolategan and renatolabs and removed request for a team June 10, 2023 01:14
@blathers-crl
Copy link

blathers-crl bot commented Jun 10, 2023

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@srosenberg srosenberg removed request for a team and renatolabs June 10, 2023 01:14
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@srosenberg srosenberg requested a review from smg260 June 10, 2023 01:14
@srosenberg srosenberg force-pushed the backport22.2-103710 branch 2 times, most recently from 5abaa81 to c9b482a Compare June 10, 2023 16:52
Previously, all roachtests used (cloud) machine types
with the AMD64 (cpu) architecture. Recently [1], new
CI infrastructure was added to run a clone of all the
nightly roachtests, configured with FIPS; i.e., same
AMD64 machine types, different AMI and crdb binary,
patched with FIPS-certified openssl native code.

As of this PR, we add the capability to execute any
roachtest in a cluster, configured with either
ARM64, FIPS, or AMD64 (default). This is controlled
via the two CLI args: `metamorphic-arm64-probability`
and `metamorphic-fips-probability`. The former denotes
the probability (over the uniform distribution) of a new
cluster provisioned using ARM64 VMs. The latter denotes
the probability of a new AMD64 cluster provisioned
with the FIPS-compliant (kernel) configuration.
In case a test is compatible only with AMD64, it's
effectively excluded from the set; i.e., both
probabilities apply to compatible tests only.

Note, the two probabilties don't have to add up to 1.
E.g., `metamorphic-arm64-probability==0.4`,
`metamorphic-fips-probability==0.2` denotes that ARM64
clusters are chosen ~40% of the time, whereas of the
remaining ~60% AMD clusters, FIPS is chosen ~20%
of the time; i.e., ~12% of all clusters will use FIPS.

Note, the values '0' and '1' are absolute. Setting both
to '0' is tantamount to the behavior before this PR.
Setting either to '1' enforces _all_ clusters
are provisioned with either ARM64 or FIPS.
A test can specify its required architecture, in which
case, it takes precedence over metamorphic settings.

This PR builds on [1], which enabled ARM64 provisioning
for AWS in roachprod. We add ARM64 provisioning for GCE,
i.e., T2A, as well as refactor 'arch' argument to
denote one of: AMD64, ARM64, FIPS, where the latter
isn't formally a CPU architecture; however, it simplifies
provisioning and binary staging.
We also modify roachprod.List to display CPU architecture,
other than AMD64, with the machine type; this should make it
easier to see which clusters are running ARM64 and FIPS
configurations, as we ramp up their testing.

The PR also adds validation to cockroach binaries and libs
to ensure we can execute tests under ARM64 and FIPS.
Furthermore, we add 'Enabled Assertions' header, generated
at build time, to the cockroach binary; the header is used
to validate whether or not the binary has runtime assertions
enabled.

Epic: none
Release note: None

Resolves: cockroachdb#94957
Resolves: cockroachdb#89268
Informs: cockroachdb#94986

[1] cockroachdb#99224
[2] cockroachdb#103243
@srosenberg srosenberg force-pushed the backport22.2-103710 branch from c9b482a to 1fa4fac Compare June 13, 2023 23:32
@srosenberg srosenberg merged commit 301a85b into cockroachdb:release-22.2 Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants