Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: add node-kill operation #122899

Merged
merged 1 commit into from
May 1, 2024
Merged

Conversation

itsbilal
Copy link
Member

This change adds a node-kill operation with 4 variants: one that drains and one that doesn't x {SIGKILL, SIGTERM}.

Epic: none

Release note: None

@itsbilal itsbilal requested a review from herkolategan April 23, 2024 14:36
@itsbilal itsbilal requested a review from a team as a code owner April 23, 2024 14:36
@itsbilal itsbilal requested review from renatolabs and removed request for a team April 23, 2024 14:36
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few small comments, mostly roachtest convenience APIs related. One general question, It seems that clean-up is responsible for the timing regarding when the node is started back up again. Would there be value in having nodes be down for a specified (random) amount of time, rather than a fixed amount of time?

pkg/cmd/roachtest/operations/node_kill.go Outdated Show resolved Hide resolved
if err != nil {
o.Fatal(err)
}
err = c.RunE(ctx, option.WithNodes(c.Node(nid)),
Copy link
Collaborator

@herkolategan herkolategan Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@renatolabs may know how to structure this better, but we could instead of doing an ExternalPGUrl use the new command util to optionally add insecure or certs to drain:

Something like this (untested):

cmd := roachtestutil.NewCommand(fmt.Sprintf("%s node drain", test.DefaultCockroachPath)).
	Flag("logtostderr", "INFO").
	Flag("port", fmt.Sprintf("{pgport:%d}", nid)).
	MaybeOption(!c.IsSecure(),"insecure").
	MaybeFlag(c.IsSecure(),"certs", install.CockroachNodeCertsDir).
	Option("self")
err = c.RunE(ctx, option.WithNodes(c.Node(nid)), cmd.String())

Also not sure if secure will be correct here, it's normally passed in (stateless)

pkg/cmd/roachtest/operations/node_kill.go Outdated Show resolved Hide resolved
pkg/cmd/roachtest/operations/node_kill.go Outdated Show resolved Hide resolved
pkg/cmd/roachtest/operations/node_kill.go Show resolved Hide resolved
pkg/cmd/roachtest/operations/node_kill.go Show resolved Hide resolved
pkg/cmd/roachtest/operations/node_kill.go Show resolved Hide resolved
pkg/cmd/roachtest/operations/node_kill.go Show resolved Hide resolved
@itsbilal itsbilal force-pushed the rt-op-node-kill branch 2 times, most recently from 01461fe to c042dd0 Compare April 30, 2024 21:32
@itsbilal
Copy link
Member Author

Thanks for the review! Addressed the points brought up.

Would there be value in having nodes be down for a specified (random) amount of time, rather than a fixed amount of time?

This is a good point, and something I'd want drt-run (which I have been working on again lately) to handle. Currently run-operation --wait-before-cleanup <duration> configures it. Because yes, we'd ideally want a more interesting interleaving of operation-induced states.

@itsbilal itsbilal force-pushed the rt-op-node-kill branch 2 times, most recently from 9fae50a to c0acd40 Compare May 1, 2024 13:48
@itsbilal itsbilal force-pushed the rt-op-node-kill branch from c0acd40 to db28f30 Compare May 1, 2024 16:03
@itsbilal
Copy link
Member Author

itsbilal commented May 1, 2024

TFTR!

bors r=renatolabs

@craig
Copy link
Contributor

craig bot commented May 1, 2024

This PR was included in a batch that was canceled, it will be automatically retried

@craig
Copy link
Contributor

craig bot commented May 1, 2024

Build failed (retrying...):

@yuzefovich
Copy link
Member

The linter is not happy.

bors r-

@craig
Copy link
Contributor

craig bot commented May 1, 2024

Canceled.

@itsbilal
Copy link
Member Author

itsbilal commented May 1, 2024

Apologies, didn't realize the linter isn't under Bazel essential CI anymore.

This change adds a node-kill operation with 4 variants:
one that drains and one that doesn't x {SIGKILL, SIGTERM}.

Epic: none

Release note: None
@itsbilal itsbilal force-pushed the rt-op-node-kill branch from db28f30 to 385e94c Compare May 1, 2024 20:44
@itsbilal
Copy link
Member Author

itsbilal commented May 1, 2024

Required checks are green now.

bors r=renatolabs

@craig craig bot merged commit f8b33a7 into cockroachdb:master May 1, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants