-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cli: job restart command #16278
cli: job restart command #16278
Conversation
ec53b40
to
a085f14
Compare
Implement the new `nomad job restart` command that allows operators to restart allocations tasks or reschedule then entire allocation. Restarts can be batched to target multiple allocations in parallel. Between each batch the command can stop and hold for a predefined time or until the user confirms that the process should proceed. This implements the "Stateless Restarts" alternative from the original RFC (https://gist.github.com/schmichael/e0b8b2ec1eb146301175fd87ddd46180). The original concept is still worth implementing, as it allows this functionality to be exposed over an API that can be consumed by the Nomad UI and other clients. But the implementation turned out to be more complex than we initially expected so we thought it would be better to release a stateless CLI-based implementation first to gather feedback and validate the restart behaviour.
a085f14
to
fe21b04
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, WIP. Will return to it.
The previous implementation would only monitor the eval resulting from the alloc stop action, but when multiple allocations for the same task group are stopped at the same time (which can happen with `-batch-size` > 1) and the replacement allocations failed to be placed only one of the evals is blocked and include the failed metrics. The second eval is just cancelled without any reference to the blocked eval, causing the goroutine to be stuck waiting for a replacement alloc preventing the whole command from exiting. The new implementation monitors all evals for the job until a blocked eval is found that contains failed placement metrics for the alloc task group, or a replacement alloc succeeds.
Fixed by adding a safety margin |
@tgross @schmichael this PR is ready for another round of review. There were a few changes from the last round that, unfortunately, didn't fall neatly into separate commits, but these were the latest changes:
Here's the diff of the new commits: |
// The flags -batch-wait and -on-error have an 'ask' option. This function | ||
// handles both to prevent asking the user twice in case they are both set to | ||
// 'ask' and an error happens. | ||
func (c *JobRestartCommand) shouldProceed(err error) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find a way to test this function since it requires user input and messing with os.Stdin
during tests seemed dangerous 😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! This whole feature is a masterclass in the difference between incidental complexity and intrinsic complexity. 😀
I've left a fairly long comment around what happens during deployments, but that's a one-liner to fix so I'm 👍 on shipping once that's handled.
}() | ||
|
||
for { | ||
answer, err := c.Ui.Ask(c.Colorize().Color(prefixedQuestion)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While doing some testing I noticed that Ask
has some weird behavior around buffering where some errant keypresses after Ctrl-C
got dropped into the input. I'm not sure if we can actually do anything about it here in Nomad, but just leaving a note in case we encounter it again:
2023-03-17T10:06:07-04:00: Still waiting for allocation "1b86450c" to be replaced
^C^[[A^[[A^[^[==> 2023-03-17T10:06:13-04:00: Restart interrupted, no more allocations will be restarted.
Are you sure you want to stop the restart process? [y/N] y
Invalid answer "\x1b[A\x1b[A\x1b\x1by"
==> 2023-03-17T10:06:13-04:00: Restart interrupted, no more allocations will be restarted.
Are you sure you want to stop the restart process? [y/N] y
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah interesting! We could try to remove signal characters from the input Ask
gives us, but the list of signals and their values may change from system to system.
It may also be possible to Ignore
all other signals while an input is being provided, but that can be pretty annoying since it breaks the expected flow of signal handling.
All of this to say that I agree there's not much we can do here...
// Use prefix matching to find job. | ||
job, err := c.JobByPrefix(c.client, c.jobID, nil) | ||
if err != nil { | ||
c.Ui.Error(err.Error()) | ||
return 1 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During testing I discovered that passing -reschedule
while there's an ongoing deployment pretty much always results in this command hanging. I used a job with a canary and manual promotion in its update block.
First attempt was running the restart while the deployment was waiting for manual promotion:
$ nomad job restart -reschedule -batch-wait=ask example
==> 2023-03-17T09:55:39-04:00: Restarting 7 allocations
2023-03-17T09:55:39-04:00: Rescheduling allocation "3958e815" for group "group"
^C==> 2023-03-17T09:57:36-04:00: Restart interrupted, no more allocations will be restarted.
Are you sure you want to stop the restart process? [y/N]
2023-03-17T09:57:39-04:00: Still waiting for allocation "3958e815" to be replaced
2023-03-17T09:58:39-04:00: Still waiting for allocation "3958e815" to be replaced
$ nomad job allocs example
ID Node ID Task Group Version Desired Status Created Modified
c7bfb048 22e1813e group 1 run running 52s ago 21s ago
3958e815 22e1813e group 1 stop complete 1m36s ago 46s ago
65110380 22e1813e group 0 run running 2m5s ago 1m48s ago
976fdbf3 22e1813e group 0 run running 2m13s ago 1m57s ago
a422b9db 22e1813e group 0 run running 2m22s ago 2m6s ago
a7c3945c 22e1813e group 0 run running 2m32s ago 2m16s ago
6fbff862 22e1813e group 0 run running 2m40s ago 2m24s ago
16472afa 22e1813e group 0 run running 4m44s ago 4m27s ago
$ nomad deployment list
ID Job ID Job Version Status Description
04edbf0c example 1 running Deployment is running but requires manual promotion
02cf29ff example 0 successful Deployment completed successfully
After canceling that and promoting the deployment, I deployed again. This time I waited until after promotion. In the example below, the allocation 1b86450c
was in a stopped state, having been replaced by the deployment:
$ nomad job restart -reschedule -batch-wait=ask example
==> 2023-03-17T10:03:37-04:00: Restarting 6 allocations
2023-03-17T10:03:37-04:00: Rescheduling allocation "95f54f2c" for group "group"
2023-03-17T10:03:37-04:00: Allocation "95f54f2c" replaced by "f254e087", waiting for "f254e087" to start running
2023-03-17T10:03:43-04:00: Allocation "f254e087" is "running"
==> 2023-03-17T10:03:43-04:00: Proceed with the next batch? [Y/n/<wait duration>] 1s
==> 2023-03-17T10:03:59-04:00: Proceeding restarts with new wait time of 1s
==> 2023-03-17T10:03:59-04:00: Waiting 1s before restarting the next batch
2023-03-17T10:04:00-04:00: Rescheduling allocation "57654725" for group "group"
2023-03-17T10:04:00-04:00: Allocation "57654725" replaced by "7bce4be0", waiting for "7bce4be0" to start running
2023-03-17T10:04:06-04:00: Allocation "7bce4be0" is "running"
==> 2023-03-17T10:04:06-04:00: Waiting 1s before restarting the next batch
2023-03-17T10:04:07-04:00: Rescheduling allocation "1b86450c" for group "group"
2023-03-17T10:05:07-04:00: Still waiting for allocation "1b86450c" to be replaced
2023-03-17T10:06:07-04:00: Still waiting for allocation "1b86450c" to be replaced
My take on this is that we probably shouldn't support running job restart
during an active deployment at all, because it makes it impossible for the deployment watcher to reason about what's going on. We could block that at this line by checking for active deployments and returning an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! We do the same thing for the autoscaler because messing with allocations during deployments is a recipe for disaster 😅
I think there's a link missing in your comment, but I implemented the check in 069d2c6 before each restart. I also didn't gate it on -reschedule
because I think even an in-place restart can be dangerous.
While working on this I realized that the core issue of the command getting blocked during a deployment was a bug in monitorReplacementAlloc
. We need to keep following NextAllocation
because, for example, if the first replacement fails Nomad will keep creating follow-up allocations. I fixed this in fde22d6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a link missing in your comment, but I implemented the check in 069d2c6 before each restart.
By "this line" I meant "here, right after JobByPrefix
. But where you put it is much better, because that'll protect us from a time-of-check-time-of-use bug. 👍
I also didn't gate it on
-reschedule
because I think even an in-place restart can be dangerous.
Good call
While working on this I realized that the core issue of the command getting blocked during a deployment was a bug in
monitorReplacementAlloc
. We need to keep followingNextAllocation
because, for example, if the first replacement fails Nomad will keep creating follow-up allocations. I fixed this in fde22d6.
Nice catch!
The `job restart` command monitors the allocation that was stopped and its follow-up until it is running. But in some scenarios is may be possible for the follow-up allocation itself to be replaced, such as in case the new allocation fails. In this scenario the command needs to keep following the `NextAllocation` ID until it finds one that is running.
Implement the new `nomad job restart` command that allows operators to restart allocations tasks or reschedule then entire allocation. Restarts can be batched to target multiple allocations in parallel. Between each batch the command can stop and hold for a predefined time or until the user confirms that the process should proceed. This implements the "Stateless Restarts" alternative from the original RFC (https://gist.github.com/schmichael/e0b8b2ec1eb146301175fd87ddd46180). The original concept is still worth implementing, as it allows this functionality to be exposed over an API that can be consumed by the Nomad UI and other clients. But the implementation turned out to be more complex than we initially expected so we thought it would be better to release a stateless CLI-based implementation first to gather feedback and validate the restart behaviour. Co-authored-by: Shishir Mahajan <[email protected]>
Implement the new `nomad job restart` command that allows operators to restart allocations tasks or reschedule then entire allocation. Restarts can be batched to target multiple allocations in parallel. Between each batch the command can stop and hold for a predefined time or until the user confirms that the process should proceed. This implements the "Stateless Restarts" alternative from the original RFC (https://gist.github.com/schmichael/e0b8b2ec1eb146301175fd87ddd46180). The original concept is still worth implementing, as it allows this functionality to be exposed over an API that can be consumed by the Nomad UI and other clients. But the implementation turned out to be more complex than we initially expected so we thought it would be better to release a stateless CLI-based implementation first to gather feedback and validate the restart behaviour. Co-authored-by: Luiz Aoqui <[email protected]> Co-authored-by: Shishir Mahajan <[email protected]>
Implement the new
nomad job restart
command that allows operators to restart allocations tasks or reschedule then entire allocation.Restarts can be batched to target multiple allocations in parallel. Between each batch the command can stop and hold for a predefined time or until the user confirms that the process should proceed.
This implements the "Stateless Restarts" alternative from the original RFC
(https://gist.github.com/schmichael/e0b8b2ec1eb146301175fd87ddd46180). The original concept is still worth implementing, as it allows this functionality to be exposed over an API that can be consumed by the Nomad UI and other clients. But the implementation turned out to be more complex than we initially expected so we thought it would be better to release a stateless CLI-based implementation first to gather feedback and validate the restart behaviour.
Closes #698
Closes #10391