Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release: audit unit/integration upgrade tests #100552

Closed
srosenberg opened this issue Apr 4, 2023 · 11 comments
Closed

release: audit unit/integration upgrade tests #100552

srosenberg opened this issue Apr 4, 2023 · 11 comments
Labels
branch-master Failures and bugs on the master branch. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) GA-blocker

Comments

@srosenberg
Copy link
Member

srosenberg commented Apr 4, 2023

All unit tests which invoke MakeTestingClusterSettingsWithVersions and pass a non-default binaryMinSupportedVersion (i.e., anything other than clusterversion.TestingBinaryMinSupportedVersion) should be audited. The reason is explained in detail in [1].

TL;DR Before [1], the test cluster would be bootstrapped at the current version unless the test passed BootstrapVersionKeyOverride. Thus, migrations were not being tested other than their idempotency. After [1], the test cluster is correctly bootstrapped at the default binaryMinSupportedVersion unless the test overrides it. This roughly translates to,

NOTE: the above only concerns the unit tests. Roachtests use a different bootstrapping mechanism which ensures an upgrade test is always bootstrapped at some previous version.

  • Node is bootstrapped and joins the test cluster at the default binaryMinSupportedVersion
  • StartServer uses s.BinaryVersionOverride() to SET CLUSTER SETTING version which triggers migrations from the bootstrapped version to the (overridden) BinaryVersion

Below are all unit tests, grouped by package, which should be audited. This set is conservative. The ones which say FAIL pass a non-default binaryMinSupportedVersion, but may still be correct. The ones which don't say FAIL are likely to be correct, but it wouldn't hurt to double-check them.

pkg/ccl/kvccl/kvtenantccl

@cockroachdb/multi-tenant

TestTenantUpgradeFailure
TestTenantUpgradeInterlock

pkg/ccl/serverccl

@cockroachdb/multi-tenant

TestServerStartupGuardrails (FAIL)
TestBumpTenantClusterVersion (FAIL)
TestValidateTargetTenantClusterVersion (FAIL)

pkg/kv/kvclient/kvcoord

TestBiDirectionalRangefeedNotUsedUntilUpgradeFinalilzed (FAIL)

pkg/kv/kvserver

TestMigrateWaitsForApplication (FAIL)
TestLeaseUpgradeVersionGate (FAIL)
TestRangeMigration (FAIL)
TestLoadBasedRebalancingObjective
TestRebalanceObjectiveManager
TestStoreConfig

pkg/kv/kvserver/batcheval

TestDeclareKeysResolveIntent

pkg/server/settingswatcher

TestVersionGuard

pkg/storage

TestPebbleIterator_ExternalCorruption #100592

pkg/upgrade/upgrademanager

@cockroachdb/sql-schema and @cockroachdb/jobs

TestAlreadyRunningJobsAreHandledProperly - #100830
TestConcurrentMigrationAttempts - #100873
TestMigrateUpdatesReplicaVersion - #100873
TestPauseMigration - #100830

pkg/upgrade/upgrades

@cockroachdb/sql-schema

TestDatabaseRoleSettingsUserIDMigration1500Users (FAIL) seems like a mistake that this is in this list
TestDeleteDescriptorsOfDroppedFunctions (FAIL) #100726
TestExternalConnectionsUserIDMigration10Users (FAIL) seems like a mistake that this is in this list
TestFixUserfileRelatedDescriptorCorruptionUpgrade (FAIL) #100726
TestPreconditionBeforeStartingAnUpgrade (FAIL) #100726
TestMigrationWithFailuresMultipleAltersOnSameColumn fixed by #100644
TestSystemJobInfoMigration #100726 #100696
TestSystemPrivilegesIndexMigration seems like a mistake that this is in this list
TestSystemActivityMigration #100726
TestWaitForDelRangeInGCJob (FAIL) #100726
TestWebSessionsUserIDMigrationNoUsers seems like a mistake that this is in this list

pkg/ccl/backupccl/datadriven_test.go

@cockroachdb/disaster-recovery

testdata/backup-restore/restore-schema-only-mixed-version:new-cluster name=s1 beforeVersion=Start22_2 disable-tenant #100696
testdata/backup-restore/in-progress-import-rollback:new-cluster name=s1 beforeVersion=23_1_MVCCTombstones disable-tenant #100696
testdata/backup-restore/restore-mixed-version:new-cluster name=s1 beforeVersion=Start22_2 disable-tenant #100696
testdata/backup-restore/restore-mixed-version:new-cluster name=s2 beforeVersion=Start22_2 share-io-dir=s1 disable-tenant #100696

[1] #99082

Jira issue: CRDB-26487

@srosenberg srosenberg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Apr 4, 2023
@srosenberg
Copy link
Member Author

srosenberg commented Apr 4, 2023

@cockroachdb/multi-tenant, @cockroachdb/sql-schema, @cockroachdb/storage, @cockroachdb/kv-prs, @cockroachdb/server-prs,@cockroachdb/disaster-recovery

@rafiss
Copy link
Collaborator

rafiss commented Apr 4, 2023

TestDatabaseRoleSettingsUserIDMigration1500Users is tagged, but it does not call MakeTestingClusterSettingsWithVersions. It does set this though:

				Server: &server.TestingKnobs{
					DisableAutomaticVersionUpgrade: make(chan struct{}),
					BootstrapVersionKeyOverride:    clusterversion.V22_2,
					BinaryVersionOverride:          clusterversion.ByKey(clusterversion.V23_1DatabaseRoleSettingsHasRoleIDColumn - 1),
				},

But so does TestSystemPrivilegesUserIDMigration1500Users, and that's not in this list.

So I was curious how the list was generated?

jbowens added a commit to jbowens/cockroach that referenced this issue Apr 4, 2023
Remove an unnecessary use of cluster.MakeTestingClusterSettingsWithVersions in
favor of cluster.MakeTestingClusterSettings.

Epic: None
Informs: cockroachdb#100552
Release note: None
@knz knz added branch-master Failures and bugs on the master branch. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 blocks-23.1.0-beta.1 labels Apr 4, 2023
craig bot pushed a commit that referenced this issue Apr 4, 2023
100539: sql: add telemetry for UDFs with RETURNS TABLE r=mgartner a=mgartner

Informs #100226

Release note: None

100554: kv: initialize consistencyLimiter during Store construction, before Start r=aliher1911 a=nvanbenschoten

Fixes #96231.

This commit attempts to fix #96231. It moves the initialization of `Store.consistencyLimiter` up from the bottom of `Store.Start` into `NewStore`. It is unsafe to initialize this variable after the call to `Store.processRaft`, which starts Raft processing. Beyond that point, incoming Raft traffic is permitted and calls to `computeChecksumPostApply` are possible.

The two stacktraces we have in that issue conclusively point to the `Store.consistencyLimiter` being nil during a call to `(*Replica).computeChecksumPostApply`. This startup-time race is the only possible explanation I could come up with.

Release note (bug fix): Fixed a rare startup race that could cause an `invalid memory address or nil pointer dereference` error.

100592: storage: remove unnecessary version override in unit test r=nicktrav a=jbowens

Remove an unnecessary use of cluster.MakeTestingClusterSettingsWithVersions in favor of cluster.MakeTestingClusterSettings.

Epic: None
Informs: #100552
Release note: None

Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
@srosenberg
Copy link
Member Author

srosenberg commented Apr 5, 2023

TestDatabaseRoleSettingsUserIDMigration1500Users is tagged, but it does not call MakeTestingClusterSettingsWithVersions.

It used to before this change [1]. When I ran this last night, my local release-23.1 was a few days behind.

[1] 16ff79d#diff-581dd19761966c928337b36%5B%E2%80%A6%5D174592c84a4c44a699816111b8c5

@aliher1911
Copy link
Contributor

TestRangeMigration - is ok, it is using made up versions to verify that below raft migration infrastructure
TestLeaseUpgradeVersionGate - is ok as it only needs version to test version gates
TestMigrateWaitsForApplication - is ok as it is testing migration machinery, not actions.
TestBiDirectionalRangefeedNotUsedUntilUpgradeFinalilzed - is ok, we only need version gate regardless of data.

What I noticed when checking tests is that docs on BinaryVersionOverride explain intricacies of forcing versions settings and bootstrap override while it would be more helpful to explain how to achieve certain goal. e.g. start test cluster with data at old version to test migration, how to start cluster to validate stateless version gate without paying price of actual migration etc.
I think audited tests could have BinaryVersionOverride removed without any ill effects, but I'm not 100% sure and I'm confident they currently test what's expected.

adityamaru added a commit to adityamaru/cockroach that referenced this issue Apr 5, 2023
This test has been rewritten to explicitly indicate
what version the test cluster is being bootstrapped
and upgraded to.

Informs: cockroachdb#100552
Release note: None
adityamaru added a commit to adityamaru/cockroach that referenced this issue Apr 5, 2023
The userfile descriptor corruption version gate can
be safely deleted as all clusters upgrading to 23.1+
are guaranteed to have been upgraded to 22.2 prior to
that.

Informs: cockroachdb#100552
Release note: None
adityamaru added a commit to adityamaru/cockroach that referenced this issue Apr 5, 2023
This version gate was for clusters that were
not fully upgraded to 22.2. Clusters that upgrade
to 23.1+ are guaranteed to have run this migration
and so it is safe to delete.

Informs: cockroachdb#100552
Release note: None
adityamaru added a commit to adityamaru/cockroach that referenced this issue Apr 5, 2023
This change bumps the mixed version restore test
to use the current minimum binary version instead of
an older V22_2Start gate that will soon be deleted.

Informs: cockroachdb#100552
Release note: None
rafiss added a commit to rafiss/cockroach that referenced this issue Apr 5, 2023
As described in cockroachdb#100552, it's important for this API to use
TestingBinaryMinSupportedVersion in order to correctly bootstrap on the
older version.

Removed TestFixUserfileRelatedDescriptorCorruptionUpgrade and TestPreconditionBeforeStartingAnUpgrade
since they are for 22.2 migrations.

Release note: None
@msbutler
Copy link
Collaborator

msbutler commented Apr 5, 2023

@srosenberg @adityamaru @rafiss I spent a bit of time trawling though this doc explaining the problem and the lengthy comment on testingKnobs.BinaryVersionOverride, and I'm still not totally clear what the best practice is for mixed version unit test writers. Here's how I guess we ought initialize a test cluster:

tc := testcluster.StartTestCluster(t, 1, base.TestClusterArgs{
		ServerArgs: base.TestServerArgs{
			Knobs: base.TestingKnobs{
				Server: &server.TestingKnobs{
					DisableAutomaticVersionUpgrade: make(chan struct{}),
					BinaryVersionOverride:          clusterversion.TestingBinaryMinSupportedVersion,
					BootstrapVersionKeyOverride:    clusterversion.BinaryMinSupportedVersionKey,
				},
			},
		},
	})

In other words, a unit test writer needs to specify both the binaryVersionOverride and the BootstrapVersionKeyOverride. I'm inferring this from @adityamaru 's TODO:

// TODO(adityamaru): We should force tests that set BinaryVersionOverride to
// also set BootstrapVersionKeyOverride so as to specify what image they would
// like the cluster bootstrapped at before upgrading to BinaryVersionOverride.

Could somebody clarify and then update the issue header?

adityamaru added a commit to adityamaru/cockroach that referenced this issue Apr 6, 2023
This change bumps the mixed version restore test
to use the current minimum binary version instead of
an older V22_2Start gate that will soon be deleted.

Informs: cockroachdb#100552
Release note: None
adityamaru added a commit to adityamaru/cockroach that referenced this issue Apr 6, 2023
This change bumps the mixed version restore test
to use the current minimum binary version instead of
an older V22_2Start gate that will soon be deleted.

Informs: cockroachdb#100552
Release note: None
craig bot pushed a commit that referenced this issue Apr 6, 2023
99958: jobs,server: graceful shutdown for secondary tenant servers r=stevendanna a=knz

Epic: CRDB-23559
Fixes #92523.

All commits but the last are from #100436.

This change ensures that tenant servers managed by the server
controller receive a graceful drain request as part of the graceful
drain process of the surrounding KV node.

This change, in turn, ensures that SQL clients connected to these
secondary tenant servers benefit from the same guarantees (and
graceful periods) as clients to the system tenant.


100726: upgrades: use TestingBinaryMinSupportedVersion in tests r=rafiss a=rafiss

As described in #100552, it's important for this API to use TestingBinaryMinSupportedVersion in order to correctly bootstrap on the older version.

informs #100552 
Release note: None

100741: contextutil: teach TimeoutError to redact only the operation name r=andreimatei a=andreimatei

Before this patch, the whole message of TimeoutError was redacted in logs. Now, only the operation name is.

Release note: None
Epic: None

100778: norm: update prune cols to match PruneJoinLeftCols/PruneJoinRightCols r=msirek a=msirek

In #90599 adjustments where made to the PruneJoinLeftCols and PruneJoinRightCols
normalization rules to avoid pruning columns which might be needed when
deriving new predicates based on foreign key constraints for lookup join.

However, this caused a problem where rules might sometimes fire in an
infinite loop because the same columns to prune keep getting added as
PruneCols in calls to DerivePruneCols. The logic in prune_cols.opt and
DerivePruneCols must be kept in sync to avoid such problems, and this
PR brings it back in sync.

Epic: none
Fixes: #100478

Release note: None

100821: cmd/roachtest: adjust disk-stalled roachtests TPS calculation r=itsbilal a=jbowens

Previously, the post-stall TPS calculation included the time that the node was stalled but before the stall triggered the node's exit. During this period, overall TPS drops until the gray failure is converted into a hard failure. This commit adjusts the post-stall TPS calculation to exclude the stalled time when TPS is expected to tank.

Epic: None
Informs: #97705.
Release note: None

Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Mark Sirek <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
craig bot pushed a commit that referenced this issue Apr 6, 2023
99288: cdc: add apache arrow parquet library and writer r=miretskiy a=jayshrivastava

#### cdc: add apache arrow parquet library
This commit installs the apache arrow parquet library for Go
at version 11. The release can be found here:
https://github.com/apache/arrow/releases/tag/go%2Fv11.0.0

This library is licensed under the Apache License 2.0.

Informs: #99028
Epic: None
Release note: None

---
#### util/parquet: create parquet writer library
This change implements a `Writer` struct in the new `util/parquet` package.
This `Writer` writes datums to the `io.Writer` sink
using a configurable parquet version (defaults to v2.6).


The package implements several features internally required to write in the parquet format:
- schema creation
- row group / column page management
- encoding/decoding of CRDB datums to parquet datums
Currently, the writer only supports types found in the TPCC workload, namely INT, DECIMAL, STRING
UUID, TIMESTAMP and BOOL.

This change also adds a benchmark and tests which verify the correctness of the
writer and test utils for reading datums from parquet files.

Informs: #99028
Epic: None
Release note: None

--- 
#### changefeedccl: add parquet writer 
This change adds the file `parquet.go` which contains
helper functions to help create parquet writers
and export data via `cdcevent.Row` structs.

This change also adds tests to ensure rows are written
to parquet files correctly.

Epic: None
Release note: None

100830: upgrademanager: fix upgrade manager tests that relied on wrong invariants r=knz a=adityamaru

The two tests in question were low level tests that upgrade clusters from mock version 41 to 42 and override the upgrade flow to test the internals of the upgrade manager. These mock cluster versions were not tied to our real cluster versions and so were not offset with the `DevOffset` when a branch is marked as a development branch. For this reason when a branch was a developmentBranch they would sort under all the "real" cluster versions while when a branch was marked as a release branch they would sort above all the "real" cluster versions. This undeterministic behaviour did not play well with certain job migrations that were added this release.

This change rewrites the test to use real cluster versions so that their values are in sync with whether the branch is a developmentBranch or not. This change also removes some overrides to make the test more intuitive. Cocnretely, the test servers now bootstrap at the minimum supported binary version and run all the upgrades until the `startCV` before executing the body of the test.

Fixes: #100685
Informs: #100552

Release note: None

Co-authored-by: Jayant Shrivastava <[email protected]>
Co-authored-by: adityamaru <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Apr 6, 2023
…ants

The two tests in question were low level tests that upgrade clusters
from mock version 41 to 42 and override the upgrade flow to
test the internals of the upgrade manager. These mock cluster versions
were not tied to our real cluster versions and so were not offset
with the `DevOffset` when a branch is marked as a development branch.
For this reason when a branch was a developmentBranch they would sort
under all the "real" cluster versions while when a branch was marked
as a release branch they would sort above all the "real" cluster versions.
This undeterministic behaviour did not play well with certain job migrations
that were added this release.

This change rewrites the test to use real cluster versions so that their
values are in sync with whether the branch is a developmentBranch or not.
This change also removes some overrides to make the test more intuitive.
Cocnretely, the test servers now bootstrap at the minimum supported binary version
and run all the upgrades until the `startCV` before executing the body of the
test.

Fixes: #100685
Informs: #100552

Release note: None
@knz knz added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 and removed branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 labels Apr 6, 2023
rafiss added a commit to rafiss/cockroach that referenced this issue Apr 7, 2023
As described in cockroachdb#100552, it's important for this API to use
TestingBinaryMinSupportedVersion in order to correctly bootstrap on the
older version.

Removed TestFixUserfileRelatedDescriptorCorruptionUpgrade and TestPreconditionBeforeStartingAnUpgrade
since they are for 22.2 migrations.

Release note: None
craig bot pushed a commit that referenced this issue Apr 7, 2023
100873: upgrademanager: audit tests that use MakeTestingClusterSettingsWithVersions r=rafiss a=rafiss

informs #100552

Release note: None

Co-authored-by: Rafi Shamim <[email protected]>
@renatolabs renatolabs added GA-blocker and removed blocks-23.1.0-beta.1 release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Apr 7, 2023
@renatolabs
Copy link
Contributor

Trying to find out what is left to be done here.

@cockroachdb/multi-tenant Have you had the chance to take a look at the tests under pkg/ccl/kvccl/kvtenantccl, pkg/ccl/serverccl, and pkg/server/settingswatcher?

@cockroachdb/storage have you had a chance to look at TestPebbleIterator_ExternalCorruption?

@nicktrav
Copy link
Collaborator

The Storage test was updated last week. I've updated the summary with a link to the patch.

@knz
Copy link
Contributor

knz commented Apr 14, 2023

I checked TestTenantUpgradeInterlock.

For the others we're waiting for @healthy-pod and @ajstorm to be back next week.

@renatolabs
Copy link
Contributor

Everything has been addressed, closing the issue. Thanks everyone for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) GA-blocker
Projects
None yet
Development

No branches or pull requests

8 participants