-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: update mixed-version backup to use new framework #96991
roachtest: update mixed-version backup to use new framework #96991
Conversation
This is part of a stacked PR comprised of 3 parts. Please review them in order.
To be reviewed in this PR: last commit. |
d54678f
to
8a51dc5
Compare
8a51dc5
to
d9a1302
Compare
Update:
To be reviewed: still just the last commit. |
d9a1302
to
b6cdc9f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 9 files at r2, 10 of 10 files at r5, 5 of 5 files at r6, 2 of 4 files at r7, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @renatolabs)
pkg/cmd/roachtest/tests/backup.go
line 153 at r7 (raw file):
csvPort := 8081 csvCmd := importBankCSVServerCommand("./cockroach", csvPort) c.Run(ctx, c.All(), csvCmd+` &> logs/workload-csv-server.log < /dev/null &`)
This seems flaky. If the ssh session dies for any reason, the CSV server would be killed. I think we either need to nohup
and disown
or use systemd-run
, e.g.,
sudo systemd-run -p LimitNOFILE=65535 --unit workload-csv-server -- $csvCmd
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 46 at r7 (raw file):
var ( invalidVersionRE = regexp.MustCompile(`[^a-zA-Z0-9.]`)
Technically it's correct, but \.
is more consistent.
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 58 at r7 (raw file):
) func sanitizeVersion(v string) string {
Not immediately clear why version needs to be sanitized. Maybe rename to sanitizeVersionForBackup
.
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 242 at r7 (raw file):
return fmt.Errorf("error while running csv server: %w", err) }) time.Sleep(time.Second) // wait for csv server to open listener
What happens if it doesn't start in time? E.g., let's say it's a scheduling chaos.
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 588 at r7 (raw file):
// in mixed-binary state, four backup collections are created: (see // four cases in the function body). If all nodes are running the next // version (which is different from the cluster version), then a
Not entirely following why the next version must be different from the cluster version when a cluster is finalized?
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 727 at r7 (raw file):
mvt := mixedversion.NewTest(ctx, t, t.L(), c, roachNodes) mvb := newMixedVersionBackup(ctx, c, roachNodes, "bank.bank") mvt.OnStartup("set short job interval", mvb.setShortJobIntervals)
This seems like an implementation detail that should perhaps be hidden from the (framework) end-user?
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 732 at r7 (raw file):
mvt.InMixedVersion("plan and run backups", mvb.planAndRunBackups) mvt.AfterUpgradeFinalized("verify backups", mvb.verifyBackups)
Might be worth a comment to explain why backups cannot be verified before finalization. It's because we don't support restores during the mixed-version state, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @srosenberg)
pkg/cmd/roachtest/tests/backup.go
line 108 at r7 (raw file):
// shared across the `backup/mixed-version` and // `declarative_schema_changer/job-compatibility-mixed-version`. Delete // this duplicated code once both tests have been migrated to the new
May better be documented as a TODO://
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 119 at r7 (raw file):
} func newEncryptionPassphrase(rng *rand.Rand) encryptionPassphrase {
doesn't look like rng
is used
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 200 at r7 (raw file):
// to be observed faster, and the test to run quicker. func (*mixedVersionBackup) setShortJobIntervals( l *logger.Logger, rng *rand.Rand, h *mixedversion.Helper,
nit l
is unused
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 354 at r7 (raw file):
func (mvb *mixedVersionBackup) saveFingerprint( l *logger.Logger, rng *rand.Rand,
nit Some functions use prng
, some like this use rng
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 559 at r7 (raw file):
// enableJobAdoption (re-)enables job adoption on the given nodes. func (mvb *mixedVersionBackup) enableJobAdoption( l *logger.Logger, nodes option.NodeListOption, h *mixedversion.Helper,
nit h
is unused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @renatolabs, @smg260, and @srosenberg)
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 125 at r7 (raw file):
pass := make([]byte, passLen) for i := range pass { pass[i] = alphabet[rand.Intn(len(alphabet))]
rng
was most likely intended to be used here.
e953deb
to
13461a3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @smg260, and @srosenberg)
pkg/cmd/roachtest/tests/backup.go
line 108 at r7 (raw file):
Previously, smg260 (Miral Gadani) wrote…
May better be documented as a TODO://
Done.
pkg/cmd/roachtest/tests/backup.go
line 153 at r7 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
This seems flaky. If the ssh session dies for any reason, the CSV server would be killed. I think we either need to
nohup
anddisown
or usesystemd-run
, e.g.,sudo systemd-run -p LimitNOFILE=65535 --unit workload-csv-server -- $csvCmd
I'm leaning towards leaving this as-is. We have lots of roachtests that run workload
throughout the entire duration of the test, and I can't remember instances of the test flaking because the session died. Surprisingly, it seems most of our SSH-related problems come from short-lived processes/commands. I'm willing to come back to this if we see a change, but my experience so far, having run this test dozens of times at this point, is that this is not a source of flakiness (and hasn't been before these changes, as this code was already here).
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 46 at r7 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Technically it's correct, but
\.
is more consistent.
Done.
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 58 at r7 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Not immediately clear why version needs to be sanitized. Maybe rename to
sanitizeVersionForBackup
.
Function renamed, comment added.
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 119 at r7 (raw file):
Previously, smg260 (Miral Gadani) wrote…
doesn't look like
rng
is used
Fixed.
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 125 at r7 (raw file):
Previously, herkolategan (Herko Lategan) wrote…
rng
was most likely intended to be used here.
Heh, this comment arrived a little too late, I had already fixed it locally. I spent at least 30 minutes trying to find out why two runs with the same seed didn't produce the exact same test, and this was the reason. We definitely need a "unused arg" linter 😅
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 200 at r7 (raw file):
Previously, smg260 (Miral Gadani) wrote…
nit
l
is unused
This one is here because the function needs to conform to the userFunc
interface.
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 242 at r7 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
What happens if it doesn't start in time? E.g., let's say it's a scheduling chaos.
Even though I also haven't seen this fail before, I changed it to something that should be more reliable (or at least fail with a clear error message on timeout).
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 354 at r7 (raw file):
Previously, smg260 (Miral Gadani) wrote…
nit Some functions use
prng
, some like this userng
Yeah, I did use prng
on the mixedversion
package. I'll make a note to self to be consistent when I make changes to mixedversion
next but I don't want to make those renames in this PR.
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 559 at r7 (raw file):
Previously, smg260 (Miral Gadani) wrote…
nit
h
is unused
Removed.
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 588 at r7 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Not entirely following why the next version must be different from the cluster version when a cluster is finalized?
The cluster should never be finalized when this function is called. What the comment is saying is that the function may be called in mixed-binary state (some nodes run binary vN, others run vN-1); or when all nodes run the same binary version but that may not be the same as the cluster version (because the upgrade is definitely not finalized at this point).
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 727 at r7 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
This seems like an implementation detail that should perhaps be hidden from the (framework) end-user?
I think it does belong to this test. It's fiddling with a setting for the purposes of making this test run in a reasonable amount of time, so it shouldn't be part of the framework itself. But maybe I misunderstand what you're suggesting, let me know.
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 732 at r7 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Might be worth a comment to explain why backups cannot be verified before finalization. It's because we don't support restores during the mixed-version state, right?
Eh, so technically only cluster restores are not allowed in mixed-version state so theoretically we could be running these restores in mixed-version. It's an arguable feature IMO and there's been discussions of whether we should continue supporting it.
In any case, this is in the list of things I'm planning to address in an upcoming PR (4th and hopefully last in this series).
Update:
To be reviewed: all changes in the diff. |
Pinging @cockroachdb/disaster-recovery for awareness (welcome any input you may have as well). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @smg260)
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 125 at r7 (raw file):
We definitely need a "unused arg" linter
I forget... why is staticcheck
not flagging it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @smg260, and @srosenberg)
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 125 at r7 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
We definitely need a "unused arg" linter
I forget... why is
staticcheck
not flagging it?
Turns out staticcheck
never had that functionality; I had this idea that at some point we were flagging unused args, but I guess we weren't. I imagine it's hard to reduce noise in such a linter due to how interfaces work in Go. There's some discussion in dominikh/go-tools#124.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 9 files at r2, 1 of 10 files at r5, 5 of 5 files at r6, 4 of 4 files at r7.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @srosenberg)
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 200 at r7 (raw file):
Previously, renatolabs (Renato Costa) wrote…
This one is here because the function needs to conform to the
userFunc
interface.
ah! sorry i should have navigated to the interface - hard to do in reviewable
13461a3
to
0b7da89
Compare
Fixed conflicts with master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! The new spec. for the mixed-version backup is more succinct and intuitive. It also covers more ground as evidenced by a newly found issue [1].
[1] #97953
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @smg260)
pkg/cmd/roachtest/tests/backup.go
line 153 at r7 (raw file):
I can't remember instances of the test flaking because the session died.
Right, we would have seen it, otherwise. Maybe the session lasts long enough for the import to complete? I'll take a look via roachprod.
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 125 at r7 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Turns out
staticcheck
never had that functionality; I had this idea that at some point we were flagging unused args, but I guess we weren't. I imagine it's hard to reduce noise in such a linter due to how interfaces work in Go. There's some discussion in dominikh/go-tools#124.
Thanks for the reference; will take a look. It might be worth a quicksemgrep
investigation to see how many false positives would arise.
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 242 at r7 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Even though I also haven't seen this fail before, I changed it to something that should be more reliable (or at least fail with a clear error message on timeout).
👍
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 588 at r7 (raw file):
Previously, renatolabs (Renato Costa) wrote…
The cluster should never be finalized when this function is called. What the comment is saying is that the function may be called in mixed-binary state (some nodes run binary vN, others run vN-1); or when all nodes run the same binary version but that may not be the same as the cluster version (because the upgrade is definitely not finalized at this point).
I suppose I must have missed the important part of the comment, namely InMixedVersion
. It would be nice to assert on that state. More generally, there will be other functions which are intended to be executed only in mixed or final state. We can't really rely on comments to prevent the user error.
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 727 at r7 (raw file):
Previously, renatolabs (Renato Costa) wrote…
I think it does belong to this test. It's fiddling with a setting for the purposes of making this test run in a reasonable amount of time, so it shouldn't be part of the framework itself. But maybe I misunderstand what you're suggesting, let me know.
Right, I am not actually sure what I had in mind at the time of the above comment. It does appear like a bit of fiddling, but it's specific to this test, so I don't see how this could be hidden from the user. Maybe there is a way to simplify this for other backup-related tests by moving the setup steps into a helper func.
This updates the `backup/mixed-version` roachtest to use the recently introduced mixed-version roachtest framework (`mixedversion` package). The main behavior exercised remains the same: backups are taken in mixed-binary state, and those backups are restored and verified at the end of the test. However, this commit also improves the coverage of mixed-version backup testing in a few ways: * **Randomization**. By virtue of using the new framework, most runs will be different from one another since the order of actions taken by the test will be different. Previously, backups would always be taken with 2 nodes in the old version and 2 nodes in the new version. Now, backups can be taken when an arbitrary number of nodes is running the new version. As a consequence, it's also possible that some executions will attempt backups when all nodes are running a new binary version, but the cluster version itself has not been updated. Other points of new randomization include the choice of the node's external dir where backups are stored, which node to connect to when running certain statements, and how much to wait between backups. * **Backup Options**. Backups will randomly be created with `revision_history` enabled, or with an `encryption_passphrase`. * **Downgrades**. The cluster is also downgraded in mixed-version tests. No downgrades happened in that test before this commit. * **Workload**. Instead of using fixed call to `generate_series` to generate data between backups, the test now runs the `bank` workload continuously during the test. A random wait between backups allows the workload to make changes to the underlying table during the test and for the backups to be taken while writes are taking place. * **Finalization**: the test _may_ attempt to create a backup as the upgrade is finalizing (i.e., migrations are running and cluster version is advancing). In addition, this test will also see improved coverage as we make more improvements to test plans generated by the `mixedversion` package. These changes will create more backup scenarios in the future without requiring any code changes to this test. This test has already helped us uncover one backup bug (cockroachdb#97953). Epic: CRDB-19321 Release note: None
0b7da89
to
080a469
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 4 files at r10.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan, @smg260, and @srosenberg)
a discussion (no related file):
Fixed conflicts with master again.
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 588 at r7 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
I suppose I must have missed the important part of the comment, namely
InMixedVersion
. It would be nice to assert on that state. More generally, there will be other functions which are intended to be executed only in mixed or final state. We can't really rely on comments to prevent the user error.
Can you explain a little further the kind of assertions you are suggesting? I'm still unsure what you are referring to when you say we are relying on comments to avoid user-error.
To provide a little more context, as I think this is related, we tend to use the term mixed-version
to mean different things: mixed-binary version, and unfinalized clusters (binary version != cluster version). The closure passed to InMixedVersion
can be in either of those states. That was an intentional decision, as it's beneficial for us to be able to make the same assertions in both cases.
I've considered exposing an API that would allow these closures to specify they only want to run in one case or the other, but I've been avoiding it for as long as I can to reduce the chances of us unintentionally not covering both situations.
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 727 at r7 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Right, I am not actually sure what I had in mind at the time of the above comment. It does appear like a bit of fiddling, but it's specific to this test, so I don't see how this could be hidden from the user. Maybe there is a way to simplify this for other backup-related tests by moving the setup steps into a helper func.
The setShortJobIntervals
function is already sharing the bulk of its logic with the standard backup
roachtest. The only part that is different is how we connect to the nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan, @smg260, and @srosenberg)
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 588 at r7 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Can you explain a little further the kind of assertions you are suggesting? I'm still unsure what you are referring to when you say we are relying on comments to avoid user-error.
To provide a little more context, as I think this is related, we tend to use the term
mixed-version
to mean different things: mixed-binary version, and unfinalized clusters (binary version != cluster version). The closure passed toInMixedVersion
can be in either of those states. That was an intentional decision, as it's beneficial for us to be able to make the same assertions in both cases.I've considered exposing an API that would allow these closures to specify they only want to run in one case or the other, but I've been avoiding it for as long as I can to reduce the chances of us unintentionally not covering both situations.
I'm going to merge the PR, but feel free to respond here and I'm happy to address concerns in an upcoming PR.
bors r=srosenberg TFTR! |
👎 Rejected by code reviews |
bors r=srosenberg |
@smg260 I think I addressed all your comments so I took the liberty of removing your |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan, @renatolabs, and @smg260)
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 588 at r7 (raw file):
Can you explain a little further the kind of assertions you are suggesting?
A runtime assertion that would fail whenever a func is being executed in the wrong state, e.g., mixed instead of initial or final.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan, @smg260, and @srosenberg)
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 588 at r7 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Can you explain a little further the kind of assertions you are suggesting?
A runtime assertion that would fail whenever a func is being executed in the wrong state, e.g., mixed instead of initial or final.
Are you envisioning a scenario where the functions created in mixed_version_backup.go
are used in a different context, potentially one where there's no mixed-version state? That sounds unlikely to me, given that the caller would have to instantiate a mixedVersionBackupTest
struct.
Build failed: |
bors retry |
Build succeeded: |
This updates the
backup/mixed-version
roachtest to use the recentlyintroduced mixed-version roachtest framework (
mixedversion
package).The main behavior exercised remains the same: backups are taken in
mixed-binary state, and those backups are restored and verified at the
end of the test. However, this commit also improves the coverage of
mixed-version backup testing in a few ways:
Randomization. By virtue of using the new framework, most runs
will be different from one another since the order of actions taken by
the test will be different. Previously, backups would always be taken
with 2 nodes in the old version and 2 nodes in the new version. Now,
backups can be taken when an arbitrary number of nodes is running the
new version. As a consequence, it's also possible that some executions
will attempt backups when all nodes are running a new binary version,
but the cluster version itself has not been updated. Other points of
new randomization include the choice of the node's external dir where
backups are stored, which node to connect to when running certain
statements, and how much to wait between backups.
Backup Options. Backups will randomly be created with
revision_history
enabled, or with anencryption_passphrase
.Downgrades. The cluster is also downgraded in mixed-version
tests. No downgrades happened in that test before this commit.
Workload. Instead of using fixed call to
generate_series
togenerate data between backups, the test now runs the
bank
workloadcontinuously during the test. A random wait between backups allows the
workload to make changes to the underlying table during the test and
for the backups to be taken while writes are taking place.
Finalization: the test may attempt to create a backup as the
upgrade is finalizing (i.e., migrations are running and cluster
version is advancing).
In addition, this test will also see improved coverage as we make more
improvements to test plans generated by the
mixedversion
package.These changes will create more backup scenarios in the future without
requiring any code changes to this test.
This test has already helped us uncover one backup bug (#97953).
Epic: CRDB-19321
Release note: None