Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: run multiple backup types in backup/mixed-version #100755

Merged
merged 2 commits into from
Apr 28, 2023

Conversation

renatolabs
Copy link
Contributor

roachtest: run multiple backup types in backup/mixed-version

This adds support to database and full cluster backups in the
backup/mixed-version roachtest. When a cluster is taken in
mixed-version state, a randomly chosen backup type is performed. Since
cluster backups also backup some system tables, this commit also adds
logic related to verifying that the contents of system tables are
properly preserved when these cluster backups are restored.

roachtest: run tpcc workload in backup/mixed-versions

This commit adds a second workload to the backup/mixed-versions
test, namely tpcc with 100 warehouses. Previously, we would only run
the bank workload; by running TPCC, we verify backups of databases
that includes multiple tables, and the backup of tables that includes
foreign key constraints. It should also put the cluster under more
load than just bank, potentially exposing interesting edge cases
that wouldn't come up before.

Epic: CRDB-19321

Release note: None

@renatolabs renatolabs requested a review from a team as a code owner April 5, 2023 21:09
@renatolabs renatolabs requested review from herkolategan and srosenberg and removed request for a team April 5, 2023 21:09
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@renatolabs renatolabs force-pushed the rc/5-multiple-backup-types branch from f5a1c2d to effb462 Compare April 5, 2023 21:18
@renatolabs renatolabs changed the title roachtest: allow callers to cancel background steps in mixedversion roachtest: run multiple backup types in backup/mixed-version Apr 5, 2023
@renatolabs
Copy link
Contributor Author

renatolabs commented Apr 5, 2023

This is a stacked PR, please review them in order (but feel free to review the second or third ones before the first one is merged):

To be reviewed here: 2nd and 3rd commits.

Copy link
Collaborator

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @renatolabs and @srosenberg)


pkg/cmd/roachtest/tests/mixed_version_backup.go line 83 at r3 (raw file):

	// are included as part of a full cluster backup. It should include
	// every table that opts-in to cluster backup (see `system_schema.go`).
	// It should be updated as system tables are added or removed from

Nit: I'm always a bit dubious of scenarios where somebody needs to remember to update something somewhere. I did note this comes from a ccl package though - GetSystemTablesToIncludeInClusterBackup- which definitely makes it harder to get to. I'm curious as I noticed cdc "breaks the rule" by importing from ccl. Would this also make the exceptions list?


pkg/cmd/roachtest/tests/mixed_version_backup.go line 227 at r3 (raw file):

	tableContents interface {
		// Load computes the contents of a table with the given name
		// (passed as argument) and stores it representation internally.

Nit: 'its'


pkg/cmd/roachtest/tests/mixed_version_backup.go line 293 at r3 (raw file):

		nonce   string

		// tables is the list of tables that were are going to verify once

Nit: we are


pkg/cmd/roachtest/tests/mixed_version_backup.go line 330 at r3 (raw file):

//
//	tableNamesWithDB("db1", []string{"foo.bar, baz"})
//	=> []string{"db1.foo", "db1.baz"}

The code doesn't seem to match the example here.

Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 9 of 9 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @renatolabs)


pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 229 at r1 (raw file):

		// cancel the underlying context.
		//
		// There should be one channel in `bgChans` for each hook in

Maybe make this stronger? E.g., "Invariant: there is exactly one channel in bgChans for each hook in Test.hooks.background."


pkg/cmd/roachtest/tests/backup.go line 166 at r1 (raw file):

	for j, ip := range ips {
		if err := retry.WithMaxAttempts(ctx, retryConfig, 10, func() error {
			addr := net.JoinHostPort(ip, fmt.Sprintf("%d", port))

Nice!


pkg/cmd/roachtest/tests/mixed_version_backup.go line 149 at r2 (raw file):

// tables, we need to quote the column names to avoid SQL syntax
// errors.
func quoteColumnsForSelect(columns []string) string {

Seems like a reusable (i.e., common) helper func.


pkg/cmd/roachtest/tests/mixed_version_backup.go line 450 at r2 (raw file):

// Load computes the fingerprints for the underlying table and stores
// the contents in the `fingeprrints` field.

fingerprrints -> fingerprints

Why is this func called Load? It's not really loading data. Maybe rename to ComputeFingerprints?


pkg/cmd/roachtest/tests/mixed_version_backup.go line 512 at r2 (raw file):

// in this system table row, in which case the error should be
// returned to the caller.
func (sr *systemTableRow) skip() bool {

skipColumn?


pkg/cmd/roachtest/tests/mixed_version_backup.go line 773 at r2 (raw file):

	}

	sc.rows = rowSet

Curious how large of a memory footprint do we expect?


pkg/cmd/roachtest/tests/mixed_version_backup.go line 1101 at r2 (raw file):

			if strings.HasPrefix(table, "system.") {
				node := h.RandomNode(rng, mvb.roachNodes)
				contents, err = newSystemTableContents(ctx, mvb.cluster, node, db, table, timestamp)

Remind me why we can't fingerprint system tables?


pkg/cmd/roachtest/tests/mixed_version_backup.go line 1439 at r2 (raw file):

// resetCluster wipes the entire cluster and starts it again with the
// current latest binary. This is done prior we attempt restoring a

prior -> before


pkg/cmd/roachtest/tests/mixed_version_backup.go line 330 at r3 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

The code doesn't seem to match the example here.

It does, but it's rather strange. We drop foo on the floor, which is the strange part, i.e., counter-intuitive.

@renatolabs renatolabs force-pushed the rc/5-multiple-backup-types branch from 9dd1c3c to 90553ff Compare April 26, 2023 14:03
@renatolabs
Copy link
Contributor Author

Copy link
Contributor Author

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @srosenberg)


pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 229 at r1 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Maybe make this stronger? E.g., "Invariant: there is exactly one channel in bgChans for each hook in Test.hooks.background."

I had slightly updated this comment (Herko had a similar suggestion). Updated to include the "Invariant" part too.


pkg/cmd/roachtest/tests/mixed_version_backup.go line 149 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Seems like a reusable (i.e., common) helper func.

Yes, which is why this function is not tied to any structs used by this test. Since all roachtests share the same package at the moment, they can use this function. Or did you mean I should move the function to an util.go file to make its reusability more explicit?


pkg/cmd/roachtest/tests/mixed_version_backup.go line 450 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

fingerprrints -> fingerprints

Why is this func called Load? It's not really loading data. Maybe rename to ComputeFingerprints?

Typo fixed.

As for the Load function: this is implementing an interface. I chose Load in the interface to be the more generic term to describe what happens across user and system tables. IMO, a valid interpretation of this function is that we are loading their contents in different ways: for user tables, the representation of the loaded data is a series of fingerprints. For system tables, we do load the entire contents of the table (as JSON encoded arrays -- implementation detail). I tried to convey that in the comments, but let me know if there's more I can do to make the point clearer.


pkg/cmd/roachtest/tests/mixed_version_backup.go line 512 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

skipColumn?

While currently this function is only called in the context of processColumn, it doesn't need to be so. skip indicates that any operation on systemTableRow should be skipped (if there was a function that processed entire rows at a time, they should also be skipped if skip() returns true.)


pkg/cmd/roachtest/tests/mixed_version_backup.go line 773 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Curious how large of a memory footprint do we expect?

Not much at all; a lot of these system tables are either empty or have very few rows in them (order of couple dozen, max).


pkg/cmd/roachtest/tests/mixed_version_backup.go line 1101 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Remind me why we can't fingerprint system tables?

Because the test is running an upgrade, and then all bets are off. System tables could have new columns, new rows, etc... Even if there wasn't an upgrade going on, restore logic for system tables can sometimes be quite custom, including "rekeying" of certain keys, which make direct comparison as provided by fingerprint matching not possible. Not to mention impossible to avoid (at roachtest level) races for scheduled jobs. There's probably more reasons I'm not remembering.


pkg/cmd/roachtest/tests/mixed_version_backup.go line 1439 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

prior -> before

Fixed, thanks.


pkg/cmd/roachtest/tests/mixed_version_backup.go line 83 at r3 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Nit: I'm always a bit dubious of scenarios where somebody needs to remember to update something somewhere. I did note this comes from a ccl package though - GetSystemTablesToIncludeInClusterBackup- which definitely makes it harder to get to. I'm curious as I noticed cdc "breaks the rule" by importing from ccl. Would this also make the exceptions list?

I kind of like the explicitness of having the entire list spelled out in the test.

To verify the contents of system tables, especially in the context of an upgrade, the test is required to be aware of certain special behavior associated with the restoring of these tables (such as ignoring theversion row in system.settings, for example). In addition, we also need to ignore the system.zones table at the moment (see associated comment in this file). So even if we imported the list of names from the backup package, we'd still need some manually written logic to deal with these special cases. And then at that point, it's arguable whether the benefit is still there. If, at some point, we stop backing up the system.settings table, as an arbitrary example, the test would continue to pass but we now have dead code in the test dealing with special cases in the restore that no longer apply.

If there are changes in the system tables we backup, I'd see a failure in this test as a "feature"; we would look at the failure and acknowledge it's expected. The fix would likely be straightforward.


pkg/cmd/roachtest/tests/mixed_version_backup.go line 227 at r3 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Nit: 'its'

Fixed, thanks.


pkg/cmd/roachtest/tests/mixed_version_backup.go line 293 at r3 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Nit: we are

Fixed, thanks.


pkg/cmd/roachtest/tests/mixed_version_backup.go line 330 at r3 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

It does, but it's rather strange. We drop foo on the floor, which is the strange part, i.e., counter-intuitive.

Yeah, it does match the example. I updated the documentation for this function, let me know if it's clearer now.

@renatolabs
Copy link
Contributor Author

Pushed a temporary commit (from #101786) to see if this fixes a deadlock in workload I'm seeing when running this test. Will drop the commit before merge.

Copy link
Collaborator

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @renatolabs and @srosenberg)


pkg/cmd/roachtest/tests/mixed_version_backup.go line 330 at r3 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Yeah, it does match the example. I updated the documentation for this function, let me know if it's clearer now.

Thanks, but what I meant, as with the original example, is that the input tableNamesWithDB("restored_db", []string{"bank.bank, stock"}) seems to suggest that it can support a comma separated string as well as a slice.
In other words shouldn't it be tableNamesWithDB("restored_db", []string{"bank.bank", "stock"})? In the current state the example would output: "restored_db.bank, stock"

@renatolabs renatolabs force-pushed the rc/5-multiple-backup-types branch from 8e27836 to a08ea52 Compare April 27, 2023 13:59
Copy link
Contributor Author

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @srosenberg)


pkg/cmd/roachtest/tests/mixed_version_backup.go line 330 at r3 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Thanks, but what I meant, as with the original example, is that the input tableNamesWithDB("restored_db", []string{"bank.bank, stock"}) seems to suggest that it can support a comma separated string as well as a slice.
In other words shouldn't it be tableNamesWithDB("restored_db", []string{"bank.bank", "stock"})? In the current state the example would output: "restored_db.bank, stock"

Oh, I see, I didn't notice that! I often make that typo, thanks for catching. Fixed.

Copy link
Contributor Author

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporary commit dropped, this is back to what should be merged.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @srosenberg)

@herkolategan herkolategan self-requested a review April 27, 2023 15:35
Copy link
Contributor

@smg260 smg260 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 9 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @srosenberg)


pkg/cmd/roachtest/tests/mixed_version_backup.go line 220 at r4 (raw file):

	// tableContents is an interface to be implemented by different
	// methods of storing and comparing the contents of a table. A lot
	// of backup-related tests use the `EXPERIMENTAL_FINGEPRINTS`

typo: EXPERIMENTAL_FINGE[R]PRINTS


pkg/cmd/roachtest/tests/mixed_version_backup.go line 239 at r4 (raw file):

	// fingerprintContents implements the `tableContents` interface
	// using the `EXPERIMENTAL_FINGEPRINTS` functionality. This should

typo: as above

@smg260
Copy link
Contributor

smg260 commented Apr 27, 2023

Is there a better way to organise the code in mixed_version_backup.go, which is getting really long? Maybe it all makes sense together. Perhaps separate helpers from the meat of the test?

This adds support to database and full cluster backups in the
`backup/mixed-version` roachtest. When a cluster is taken in
mixed-version state, a randomly chosen backup type is performed. Since
cluster backups also backup some system tables, this commit also adds
logic related to verifying that the contents of system tables are
properly preserved when these cluster backups are restored.

Epic: CRDB-19321

Release note: None
This commit adds a second workload to the `backup/mixed-versions`
test, namely `tpcc` with 100 warehouses. Previously, we would only run
the `bank` workload; by running TPCC, we verify backups of databases
that includes multiple tables, and the backup of tables that includes
foreign key constraints. It should also put the cluster under more
load than just `bank`, potentially exposing interesting edge cases
that wouldn't come up before.

Epic: CRDB-19321

Release note: None
@renatolabs renatolabs force-pushed the rc/5-multiple-backup-types branch from a08ea52 to 932b02b Compare April 27, 2023 19:28
Copy link
Contributor Author

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @smg260, and @srosenberg)


pkg/cmd/roachtest/tests/mixed_version_backup.go line 220 at r4 (raw file):

Previously, smg260 (Miral Gadani) wrote…

typo: EXPERIMENTAL_FINGE[R]PRINTS

Fixed, thanks.


pkg/cmd/roachtest/tests/mixed_version_backup.go line 239 at r4 (raw file):

Previously, smg260 (Miral Gadani) wrote…

typo: as above

Fixed, thanks.

Copy link
Contributor Author

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it is becoming quite long. I'm not sure how much can be moved to a different file, especially given the current roachtest organization where every test is under the same package (does it add more confusion if we split a test in multiple files?).

I've talked to DR about making some of the functionality in this test more generally applicable to non-mixed-version backup/restore tests. I think it makes sense to defer any moving of code to different files at that point.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @smg260, and @srosenberg)

@renatolabs
Copy link
Contributor Author

I expect this to flake due to connection warmup hanging (to be fixed by #101786). I'm going to merge it anyway so that we get some runs in; flakes should be pretty easy to identify.

TFTRs!

bors r=herkolategan

@craig
Copy link
Contributor

craig bot commented Apr 28, 2023

Build succeeded:

@craig craig bot merged commit e45aac9 into cockroachdb:master Apr 28, 2023
@renatolabs renatolabs deleted the rc/5-multiple-backup-types branch April 28, 2023 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants