-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
delete pod to stop a node instead of scaling sts #15214
Conversation
⏱️ 1h 42m total CI duration on this PR
|
89d64d3
to
3f50267
Compare
|
||
// Keep deleting the pod if it recovers before the deadline | ||
while Instant::now() < deadline { | ||
match self.wait_until_healthy(deadline).await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting till healthy will make the node catch up to latest state, which is something we don't want to do. Can we wait till the pod is running instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't that match the existing behavior? These tests were using Node::start()
which calls wait_until_healthy()
I updated this to check the pod status instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, before we had stop() -> sleep -> start() -> healthy. Now, kill() -> healthy() -> kill() -> healthy() -> ... right?
3f50267
to
3d76c7f
Compare
9b60399
to
0fcc4e0
Compare
Looks like these tests are still timing out with these changes. I'm going to try increasing the timeout |
This didn't work. We're still seeing delays in pod startup due to reattaching PVCs I increased the timeouts in #15244 to get the Forge tests working again. We can explore using node affinities to avoid pods getting moved between nodes |
Description
Add a method to temporarily stop a fullnode/validator node by repeatedly deleting the pod. This is used for the fullnode/validator stress tests where we want to keep to keep the underlying node allocated to the StatefulSet. We suspect that the previous method of scaling the StatefulSet was causing node allocation delays and causing these test to timeout
How Has This Been Tested?
Ran the adhoc forge workflow
Key Areas to Review
Type of Change
Which Components or Systems Does This Change Impact?
Checklist