Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: move validate-after-version-upgrade to new framework #113643

Conversation

rafiss
Copy link
Collaborator

@rafiss rafiss commented Nov 1, 2023

The new framework provides a few more testing enhancements and is the only one that will be maintained.

fixes #110535
Release note: None

@rafiss rafiss added the backport-23.2.x Flags PRs that need to be backported to 23.2. label Nov 1, 2023
@rafiss rafiss requested review from annrpom and a team November 1, 2023 22:16
@rafiss rafiss requested a review from a team as a code owner November 1, 2023 22:16
@rafiss rafiss requested review from herkolategan and renatolabs and removed request for a team November 1, 2023 22:16
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@renatolabs
Copy link
Contributor

I'm wondering -- could this test be written as a logictest instead? It seems like the kind of check that doesn't really need all the machinery of a real cluster.

If that's not an option, another alternative I'd suggest considering is to piggyback on acceptance/version-upgrade for non-local runs.

@rafiss
Copy link
Collaborator Author

rafiss commented Nov 2, 2023

I didn't see a way to write this as a logictest when I considered it. The issue is that logictests don't have a way to wipe the cluster. This test starts a cluster on the latest version to get the system tables, then wipes the cluster, then starts it on the predecessor version and upgrades it.

That is actually somewhat of a "complex" operation, so I think it does make sense as a roachtest, which provides that flexibility. Does the version-upgrade test you mentioned do something similar?

@rafiss rafiss force-pushed the port-systemschema/validate-after-version-upgrade branch from bfd9e9e to 3a30edb Compare November 2, 2023 14:34
Copy link
Contributor

@Xiang-Gu Xiang-Gu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much cleaner! Thanks!

@Xiang-Gu
Copy link
Contributor

Xiang-Gu commented Nov 2, 2023

My understanding of the acceptance/version-upgrade is that you can specify what to do upon startup and what you want to test in a mixed version state, which is a very clean paradigm to test various functionalities in a mixed version state. Still, It didn't seem immediately obvious to me how to "start on new version, do something, wipe out, restart on old version, and upgrade to new version" under this paradigm, which is what this new roachtest needed.

@renatolabs
Copy link
Contributor

The issue is that logictests don't have a way to wipe the cluster

That's true, but it also sounds like it wouldn't be too bad to implement (wipe is just kill+ rm).

Another alternative to consider: how long does it take to run this test in local mode? Would it make sense to run it as part of the local acceptance roachtests that run in CI? It's simple enough that I believe it qualifies.

@renatolabs
Copy link
Contributor

It didn't seem immediately obvious to me how to "start on new version, do something, wipe out, restart on old version, and upgrade to new version" under this paradigm

We would just need to have the exact same logic to load the expected contents before we start the test, just like we do here.


// Compare the results.
validateEquivalenceStep(&expected, &actual),
binary := uploadVersion(ctx, t, c, c.All(), clusterupgrade.CurrentVersion())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can skip this step and use install.BinaryOption(test.DefaultCockroachPath) below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

expected = obtainSystemSchema(ctx, t.L(), c, 1)
c.Wipe(ctx, false /* preserveCerts */, c.All())

mvt := mixedversion.NewTest(ctx, t, t.L(), c, c.All(), mixedversion.NeverUseFixtures)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Is this test incompatible with fixtures?
  • I believe you'll want mixedversion.NumUpgrades(1) here to force we are only performing one upgrade in this test. Otherwise it might choose to perform several upgrades in the same test: it go through 22.2 -> 23.1 -> master. AfterUpgradeFinalized is called after each upgrade.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test incompatible with fixtures?

According to this comment, yes.

// NeverUseFixtures is an option that can be passed to `NewTest` to
// disable the use of fixtures in the test. Necessary if the test
// wants to use a number of cockroach nodes other than 4.

I believe you'll want mixedversion.NumUpgrades(1) here to force we are only performing one upgrade in this test. Otherwise it might choose to perform several upgrades in the same test: it go through 22.2 -> 23.1 -> master. AfterUpgradeFinalized is called after each upgrade.

Ah nice, thanks for the pointer.

@rafiss
Copy link
Collaborator Author

rafiss commented Nov 2, 2023

That's true, but it also sounds like it wouldn't be too bad to implement (wipe is just kill+ rm).

Hm but it requires a few changes to the cockroach-go/testserver API. I don't have the spare cycles to implement that myself, but I'm not opposed if you want to try adding that. Additionally, the mixed version logictest framework doesn't have a way to tell it to start on a specific version, and adding that could make the directives/syntax a bit complicated.

Another alternative to consider: how long does it take to run this test in local mode? Would it make sense to run it as part of the local acceptance roachtests that run in CI? It's simple enough that I believe it qualifies.

Yeah a local version does make sense. I will work on that.

@rafiss rafiss force-pushed the port-systemschema/validate-after-version-upgrade branch 2 times, most recently from 477cdec to 9c7f2ef Compare November 2, 2023 17:43
@rafiss rafiss requested a review from renatolabs November 2, 2023 17:50
@rafiss rafiss force-pushed the port-systemschema/validate-after-version-upgrade branch from 9c7f2ef to a34dba9 Compare November 2, 2023 18:46
@renatolabs
Copy link
Contributor

I was running this test on my gceworker and it failed, and it seems to be failing on TC as well. Code looks good to me, not sure if it's a real bug or something is wrong in the test setup.

@rafiss
Copy link
Collaborator Author

rafiss commented Nov 2, 2023

not sure if it's a real bug or something is wrong in the test setup.

I see this failure too. i am surprised by it. a diff like this makes it seem like something is really off

 CREATE TABLE public.table_statistics (
   "tableID" INT8 NOT NULL,
   "statisticID" INT8 NOT NULL DEFAULT unique_rowid(),
   name STRING NULL,
   "columnIDs" INT8[] NOT NULL,
-  "createdAt" TIMESTAMP NOT NULL DEFAULT now():::TIMESTAMP,
+  "createdAt" TIMESTAMP NOT NULL DEFAULT now():::TIMESTAMPTZ,
   "rowCount" INT8 NOT NULL,
   "distinctCount" INT8 NOT NULL,
   "nullCount" INT8 NOT NULL,
   histogram BYTES NULL,
   "avgSize" INT8 NOT NULL DEFAULT 0:::INT8,
@@ -271,11 +271,13 @@
   sampling_probability FLOAT8 NULL,
   plan_gist STRING NULL,
   anti_plan_gist BOOL NULL,
   CONSTRAINT "primary" PRIMARY KEY (id ASC),
   INDEX completed_idx_v2 (completed ASC, id ASC) STORING (statement_fingerprint, min_execution_latency, expires_at, sampling_probability, plan_gist, anti_plan_gist),
-  CONSTRAINT check_sampling_probability CHECK (sampling_probability BETWEEN 0.0:::FLOAT8 AND 1.0:::FLOAT8)
+  FAMILY "primary" (id, completed, statement_fingerprint, statement_diagnostics_id, requested_at, sampling_probability, plan_gist, anti_plan_gist),
+  FAMILY fam_6_min_execution_latency (min_execution_latency),
+  FAMILY fam_7_expires_at (expires_at)
 );

i will debug further

@rafiss
Copy link
Collaborator Author

rafiss commented Nov 2, 2023

It looks like the reason it's failing is due to issues with system schema migrations that predate this test. One example is dd7cbf2. In that commit, the total_consumption column was added to the end of the table in the migration, but in the bootstrap code it was added in the beginning. It results in this unexpected diff.

@@ -390,14 +392,14 @@
 	last_update TIMESTAMP NOT NULL,
 	ru_burst_limit FLOAT8 NULL,
 	ru_refill_rate FLOAT8 NULL,
 	ru_current FLOAT8 NULL,
 	current_share_sum FLOAT8 NULL,
-	total_consumption BYTES NULL,
 	instance_lease BYTES NULL,
 	instance_seq INT8 NULL,
 	instance_shares FLOAT8 NULL,
+	total_consumption BYTES NULL,
 	CONSTRAINT "primary" PRIMARY KEY (tenant_id ASC, instance_id ASC)
 );

But the confusing thing is that dd7cbf2 is added as part of the upgrade to v22.1.0. According to the test logs, this test started on v23.1.11... so I'm extremely confused why we'd see a v22.1.0 migration being executed.

@renatolabs
Copy link
Contributor

According to the test logs, this test started on v23.1.11

Not sure if it helps, but the cluster was bootstrapped on 22.2:

mixed-version test plan for upgrading from "v22.2.15" to "v23.1.11" to "113643":

@rafiss
Copy link
Collaborator Author

rafiss commented Nov 2, 2023

Where can I find that? I don't see it in the artifacts at acceptance/version-upgrade/run_1/artifacts.zip!/test.log

@renatolabs
Copy link
Contributor

Ah, I was looking at the Local Roachtest (FIPS) build, but I guess the 22.2 bit is irrelevant then since the same failure happened on the non-FIPS build where we did start on 23.1.

@rafiss
Copy link
Collaborator Author

rafiss commented Nov 2, 2023

i see, well regardless, if we want this test to be part of version-upgrade we will need to tell the test to start from v22.2 or later (whether or not that's related to this issue we're seeing). and i think we may also need to use mixedversion.AlwaysUseLatestPredecessors. is it still a good idea to have this test be a part of version-upgrade?

@renatolabs
Copy link
Contributor

Maybe the problem is that these tests are bootstrapping from fixtures, and the fixtures were created before the fixes you mentioned were merged?

Yes, given what we learned I think we should not be using acceptance/version-upgrade. I still think it makes sense to have this test only run in local mode. How about we make it an acceptance test?

func registerAcceptance(r registry.Registry) {

@rafiss rafiss force-pushed the port-systemschema/validate-after-version-upgrade branch from a34dba9 to 09b5d1b Compare November 2, 2023 22:12
Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

I think this test should only run in local mode, but I'll address that together with a few other tests in follow-up work. Thanks for iterating on the approach!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @annrpom and @herkolategan)

@rafiss
Copy link
Collaborator Author

rafiss commented Nov 3, 2023

thanks for the reviews!

bors r+

@craig
Copy link
Contributor

craig bot commented Nov 3, 2023

Build failed:

{
name: "validate-system-schema-after-version-upgrade",
fn: runValidateSystemSchemaAfterVersionUpgrade,
timeout: 30 * time.Minute,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need defaultLeases: true here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

The new framework provides a few more testing enhancements and is the
only one that will be maintained.

It also adds the test to the suite of local acceptance tests that are
run in CI.

Release note: None
@rafiss rafiss force-pushed the port-systemschema/validate-after-version-upgrade branch from 09b5d1b to e76ddd3 Compare November 3, 2023 18:09
@rafiss
Copy link
Collaborator Author

rafiss commented Nov 3, 2023

bors r+

@rafiss
Copy link
Collaborator Author

rafiss commented Nov 3, 2023

bors r+

@craig
Copy link
Contributor

craig bot commented Nov 3, 2023

Build succeeded:

@craig craig bot merged commit 17e08f8 into cockroachdb:master Nov 3, 2023
3 checks passed
@renatolabs
Copy link
Contributor

blathers backport 23.1

Copy link

blathers-crl bot commented Nov 3, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from e76ddd3 to blathers/backport-release-23.1-113643: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@renatolabs
Copy link
Contributor

We should probably backport to 23.1 too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.2.x Flags PRs that need to be backported to 23.2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: port systemschema/validate-after-version-upgrade to new mixed-version framework
4 participants