cli: job restart command #16278

lgfa29 · 2023-02-28T17:38:09Z

Implement the new nomad job restart command that allows operators to restart allocations tasks or reschedule then entire allocation.

Restarts can be batched to target multiple allocations in parallel. Between each batch the command can stop and hold for a predefined time or until the user confirms that the process should proceed.

This implements the "Stateless Restarts" alternative from the original RFC
(https://gist.github.com/schmichael/e0b8b2ec1eb146301175fd87ddd46180). The original concept is still worth implementing, as it allows this functionality to be exposed over an API that can be consumed by the Nomad UI and other clients. But the implementation turned out to be more complex than we initially expected so we thought it would be better to release a stateless CLI-based implementation first to gather feedback and validate the restart behaviour.

Closes #698
Closes #10391

Implement the new `nomad job restart` command that allows operators to restart allocations tasks or reschedule then entire allocation. Restarts can be batched to target multiple allocations in parallel. Between each batch the command can stop and hold for a predefined time or until the user confirms that the process should proceed. This implements the "Stateless Restarts" alternative from the original RFC (https://gist.github.com/schmichael/e0b8b2ec1eb146301175fd87ddd46180). The original concept is still worth implementing, as it allows this functionality to be exposed over an API that can be consumed by the Nomad UI and other clients. But the implementation turned out to be more complex than we initially expected so we thought it would be better to release a stateless CLI-based implementation first to gather feedback and validate the restart behaviour.

schmichael

Sorry, WIP. Will return to it.

command/job_restart.go

website/content/docs/commands/job/restart.mdx

command/job_restart.go

website/content/docs/commands/job/restart.mdx

command/job_restart.go

The previous implementation would only monitor the eval resulting from the alloc stop action, but when multiple allocations for the same task group are stopped at the same time (which can happen with `-batch-size` > 1) and the replacement allocations failed to be placed only one of the evals is blocked and include the failed metrics. The second eval is just cancelled without any reference to the blocked eval, causing the goroutine to be stuck waiting for a replacement alloc preventing the whole command from exiting. The new implementation monitors all evals for the job until a blocked eval is found that contains failed placement metrics for the alloc task group, or a replacement alloc succeeds.

lgfa29 · 2023-03-16T20:39:00Z

TestJobRestartCommand_shutdownDelay_reschedule is flaky. I will refactor it so it's less time sensitive.

Fixed by adding a safety margin

lgfa29 · 2023-03-16T20:54:39Z

@tgross @schmichael this PR is ready for another round of review. There were a few changes from the last round that, unfortunately, didn't fall neatly into separate commits, but these were the latest changes:

Batch error handling is now done using the -on-error flag. Since we now have up to two questions that we ask users between batches (-batch-wait and -on-error) I refactor the logic to make sure we only ask them once.
I realized that I could wrap the Meta.Ui in a *cli.ConcurrentUi instead of having two UIs in the command, which was confusing. But this required me to do some reflect magic to "unwrap" the UIs.
I added a few more output messages, specially when -reschedule is used because the process can take a while (or not progress at all!). I then noticed that calling monitorPlacementFailures() per alloc is kind of redundant because evals are per job anyway, but I couldn't find a nice way to have a single monitor goroutine that didn't involve a major refactoring. In hindsight the event stream would probably be better here.
Handle multi-region job by asking the user for confirmation that it's OK to restart the job in a single region. Coordinated multi-region job restart is left as a future work.
Since there were so many questions to ask, I created askQuestion() to format the question and handle inputs in a more standardized way.

Here's the diff of the new commits:
https://github.com/hashicorp/nomad/pull/16278/files/971ec5d4b55e01f719c0aa294c39b1f9f2066323..8cdcc1411716ec4a60ca881b4e2559898c8c2748

lgfa29 · 2023-03-16T20:57:45Z

command/job_restart.go

+// The flags -batch-wait and -on-error have an 'ask' option. This function
+// handles both to prevent asking the user twice in case they are both set to
+// 'ask' and an error happens.
+func (c *JobRestartCommand) shouldProceed(err error) bool {


I couldn't find a way to test this function since it requires user input and messing with os.Stdin during tests seemed dangerous 😬

tgross

LGTM! This whole feature is a masterclass in the difference between incidental complexity and intrinsic complexity. 😀

I've left a fairly long comment around what happens during deployments, but that's a one-liner to fix so I'm 👍 on shipping once that's handled.

tgross · 2023-03-17T14:11:00Z

command/job_restart.go

+	}()
+
+	for {
+		answer, err := c.Ui.Ask(c.Colorize().Color(prefixedQuestion))


While doing some testing I noticed that Ask has some weird behavior around buffering where some errant keypresses after Ctrl-C got dropped into the input. I'm not sure if we can actually do anything about it here in Nomad, but just leaving a note in case we encounter it again:

2023-03-17T10:06:07-04:00: Still waiting for allocation "1b86450c" to be replaced ^C^[[A^[[A^[^[==> 2023-03-17T10:06:13-04:00: Restart interrupted, no more allocations will be restarted. Are you sure you want to stop the restart process? [y/N] y Invalid answer "\x1b[A\x1b[A\x1b\x1by" ==> 2023-03-17T10:06:13-04:00: Restart interrupted, no more allocations will be restarted. Are you sure you want to stop the restart process? [y/N] y

Ah interesting! We could try to remove signal characters from the input Ask gives us, but the list of signals and their values may change from system to system.

It may also be possible to Ignore all other signals while an input is being provided, but that can be pretty annoying since it breaks the expected flow of signal handling.

All of this to say that I agree there's not much we can do here...

tgross · 2023-03-17T14:17:18Z

command/job_restart.go

+	// Use prefix matching to find job.
+	job, err := c.JobByPrefix(c.client, c.jobID, nil)
+	if err != nil {
+		c.Ui.Error(err.Error())
+		return 1
+	}


During testing I discovered that passing -reschedule while there's an ongoing deployment pretty much always results in this command hanging. I used a job with a canary and manual promotion in its update block.

First attempt was running the restart while the deployment was waiting for manual promotion:

$ nomad job restart -reschedule -batch-wait=ask example ==> 2023-03-17T09:55:39-04:00: Restarting 7 allocations 2023-03-17T09:55:39-04:00: Rescheduling allocation "3958e815" for group "group" ^C==> 2023-03-17T09:57:36-04:00: Restart interrupted, no more allocations will be restarted. Are you sure you want to stop the restart process? [y/N] 2023-03-17T09:57:39-04:00: Still waiting for allocation "3958e815" to be replaced 2023-03-17T09:58:39-04:00: Still waiting for allocation "3958e815" to be replaced $ nomad job allocs example ID Node ID Task Group Version Desired Status Created Modified c7bfb048 22e1813e group 1 run running 52s ago 21s ago 3958e815 22e1813e group 1 stop complete 1m36s ago 46s ago 65110380 22e1813e group 0 run running 2m5s ago 1m48s ago 976fdbf3 22e1813e group 0 run running 2m13s ago 1m57s ago a422b9db 22e1813e group 0 run running 2m22s ago 2m6s ago a7c3945c 22e1813e group 0 run running 2m32s ago 2m16s ago 6fbff862 22e1813e group 0 run running 2m40s ago 2m24s ago 16472afa 22e1813e group 0 run running 4m44s ago 4m27s ago $ nomad deployment list ID Job ID Job Version Status Description 04edbf0c example 1 running Deployment is running but requires manual promotion 02cf29ff example 0 successful Deployment completed successfully

After canceling that and promoting the deployment, I deployed again. This time I waited until after promotion. In the example below, the allocation 1b86450c was in a stopped state, having been replaced by the deployment:

$ nomad job restart -reschedule -batch-wait=ask example ==> 2023-03-17T10:03:37-04:00: Restarting 6 allocations 2023-03-17T10:03:37-04:00: Rescheduling allocation "95f54f2c" for group "group" 2023-03-17T10:03:37-04:00: Allocation "95f54f2c" replaced by "f254e087", waiting for "f254e087" to start running 2023-03-17T10:03:43-04:00: Allocation "f254e087" is "running" ==> 2023-03-17T10:03:43-04:00: Proceed with the next batch? [Y/n/<wait duration>] 1s ==> 2023-03-17T10:03:59-04:00: Proceeding restarts with new wait time of 1s ==> 2023-03-17T10:03:59-04:00: Waiting 1s before restarting the next batch 2023-03-17T10:04:00-04:00: Rescheduling allocation "57654725" for group "group" 2023-03-17T10:04:00-04:00: Allocation "57654725" replaced by "7bce4be0", waiting for "7bce4be0" to start running 2023-03-17T10:04:06-04:00: Allocation "7bce4be0" is "running" ==> 2023-03-17T10:04:06-04:00: Waiting 1s before restarting the next batch 2023-03-17T10:04:07-04:00: Rescheduling allocation "1b86450c" for group "group" 2023-03-17T10:05:07-04:00: Still waiting for allocation "1b86450c" to be replaced 2023-03-17T10:06:07-04:00: Still waiting for allocation "1b86450c" to be replaced

My take on this is that we probably shouldn't support running job restart during an active deployment at all, because it makes it impossible for the deployment watcher to reason about what's going on. We could block that at this line by checking for active deployments and returning an error.

Nice catch! We do the same thing for the autoscaler because messing with allocations during deployments is a recipe for disaster 😅

I think there's a link missing in your comment, but I implemented the check in 069d2c6 before each restart. I also didn't gate it on -reschedule because I think even an in-place restart can be dangerous.

While working on this I realized that the core issue of the command getting blocked during a deployment was a bug in monitorReplacementAlloc. We need to keep following NextAllocation because, for example, if the first replacement fails Nomad will keep creating follow-up allocations. I fixed this in fde22d6.

I think there's a link missing in your comment, but I implemented the check in 069d2c6 before each restart.

By "this line" I meant "here, right after JobByPrefix. But where you put it is much better, because that'll protect us from a time-of-check-time-of-use bug. 👍

I also didn't gate it on -reschedule because I think even an in-place restart can be dangerous.

Good call

While working on this I realized that the core issue of the command getting blocked during a deployment was a bug in monitorReplacementAlloc. We need to keep following NextAllocation because, for example, if the first replacement fails Nomad will keep creating follow-up allocations. I fixed this in fde22d6.

Nice catch!

command/meta.go

The `job restart` command monitors the allocation that was stopped and its follow-up until it is running. But in some scenarios is may be possible for the follow-up allocation itself to be replaced, such as in case the new allocation fails. In this scenario the command needs to keep following the `NextAllocation` ID until it finds one that is running.

Implement the new `nomad job restart` command that allows operators to restart allocations tasks or reschedule then entire allocation. Restarts can be batched to target multiple allocations in parallel. Between each batch the command can stop and hold for a predefined time or until the user confirms that the process should proceed. This implements the "Stateless Restarts" alternative from the original RFC (https://gist.github.com/schmichael/e0b8b2ec1eb146301175fd87ddd46180). The original concept is still worth implementing, as it allows this functionality to be exposed over an API that can be consumed by the Nomad UI and other clients. But the implementation turned out to be more complex than we initially expected so we thought it would be better to release a stateless CLI-based implementation first to gather feedback and validate the restart behaviour. Co-authored-by: Shishir Mahajan <[email protected]>

Implement the new `nomad job restart` command that allows operators to restart allocations tasks or reschedule then entire allocation. Restarts can be batched to target multiple allocations in parallel. Between each batch the command can stop and hold for a predefined time or until the user confirms that the process should proceed. This implements the "Stateless Restarts" alternative from the original RFC (https://gist.github.com/schmichael/e0b8b2ec1eb146301175fd87ddd46180). The original concept is still worth implementing, as it allows this functionality to be exposed over an API that can be consumed by the Nomad UI and other clients. But the implementation turned out to be more complex than we initially expected so we thought it would be better to release a stateless CLI-based implementation first to gather feedback and validate the restart behaviour. Co-authored-by: Luiz Aoqui <[email protected]> Co-authored-by: Shishir Mahajan <[email protected]>

vercel bot deployed to Preview – nomad-storybook-and-ui February 28, 2023 17:42 View deployment

lgfa29 force-pushed the f-job-restart-cmd branch from ec53b40 to a085f14 Compare February 28, 2023 23:17

vercel bot deployed to Preview – nomad February 28, 2023 23:21 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui February 28, 2023 23:22 View deployment

shishir-a412ed and others added 2 commits February 28, 2023 19:06

docs: nomad job restart command

fe21b04

lgfa29 force-pushed the f-job-restart-cmd branch from a085f14 to fe21b04 Compare March 1, 2023 00:06

lgfa29 requested review from jrasell, tgross and schmichael March 1, 2023 00:07

vercel bot deployed to Preview – nomad March 1, 2023 00:09 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui March 1, 2023 00:11 View deployment

schmichael reviewed Mar 1, 2023

View reviewed changes

command/job_restart.go Outdated Show resolved Hide resolved

command/job_restart.go Outdated Show resolved Hide resolved

command/job_restart.go Outdated Show resolved Hide resolved

changelog: add entry for #16278

b2551ae

lgfa29 added the theme/cli label Mar 1, 2023

lgfa29 self-assigned this Mar 1, 2023

lgfa29 mentioned this pull request Mar 1, 2023

Nomad job restart #11533

Closed

vercel bot deployed to Preview – nomad-storybook-and-ui March 1, 2023 01:15 View deployment

address some code review

5fd74ae

lgfa29 commented Mar 1, 2023

View reviewed changes

command/job_restart.go Outdated Show resolved Hide resolved

lgfa29 commented Mar 1, 2023

View reviewed changes

command/job_restart.go Outdated Show resolved Hide resolved

vercel bot deployed to Preview – nomad-storybook-and-ui March 1, 2023 01:25 View deployment

remove random leftover code

47abdf9

vercel bot deployed to Preview – nomad-storybook-and-ui March 1, 2023 01:33 View deployment

tgross reviewed Mar 1, 2023

View reviewed changes

schmichael reviewed Mar 2, 2023

View reviewed changes

lgfa29 added 3 commits March 1, 2023 20:15

address more code review changes

bbfe271

cli: make job restart error messages more consistent

2c23e7a

vercel bot deployed to Preview – nomad-storybook-and-ui March 2, 2023 02:38 View deployment

lgfa29 added 3 commits March 14, 2023 21:39

fix handling of interrupt signal

939d645

Merge remote-tracking branch 'origin/main' into f-job-restart-cmd

6e9b8be

update ACL

93c35fb

vercel bot deployed to Preview – nomad March 15, 2023 01:57 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui March 15, 2023 01:57 View deployment

last touches

8cdcc14

vercel bot deployed to Preview – nomad-storybook-and-ui March 16, 2023 20:41 View deployment

lgfa29 requested review from tgross and schmichael March 16, 2023 20:54

lgfa29 commented Mar 16, 2023

View reviewed changes

tgross approved these changes Mar 17, 2023

View reviewed changes

tgross reviewed Mar 17, 2023

View reviewed changes

command/meta.go Show resolved Hide resolved

lgfa29 added 2 commits March 17, 2023 12:07

cli: fix panic on Meta.Colorize

e53fbe0

cli: use os.Interrupt instead of syscall.SIGINT

810800b

vercel bot deployed to Preview – nomad-storybook-and-ui March 17, 2023 16:14 View deployment

lgfa29 added 2 commits March 17, 2023 17:08

cli: prevent job restart if there is an active deployment

069d2c6

vercel bot deployed to Preview – nomad-storybook-and-ui March 17, 2023 22:29 View deployment

lgfa29 added the backport/1.5.x backport to 1.5.x release line label Mar 23, 2023

lgfa29 merged commit fffdbdf into main Mar 23, 2023

lgfa29 deleted the f-job-restart-cmd branch March 23, 2023 22:28

hc-github-team-nomad-core mentioned this pull request Mar 23, 2023

Backport of cli: job restart command into release/1.5.x #16628

Merged

lgfa29 mentioned this pull request Apr 17, 2023

Reschedule workloads after dynamic node metadata update #16834

Closed

shoenig mentioned this pull request May 16, 2023

manual backport 889c5aa0f7 #17208

Closed

lgfa29 mentioned this pull request May 29, 2023

graceful restart of non-leader server agent makes active job evaluation segfault #17329

Closed

lgfa29 mentioned this pull request Feb 13, 2024

Schedule-based autoscaling hashicorp/nomad-autoscaler#448

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cli: job restart command #16278

cli: job restart command #16278

lgfa29 commented Feb 28, 2023 •

edited

Loading

schmichael left a comment

lgfa29 commented Mar 16, 2023

lgfa29 commented Mar 16, 2023

lgfa29 Mar 16, 2023

tgross left a comment •

edited

Loading

tgross Mar 17, 2023

lgfa29 Mar 17, 2023

tgross Mar 17, 2023

lgfa29 Mar 17, 2023

tgross Mar 20, 2023

cli: job restart command #16278

cli: job restart command #16278

Conversation

lgfa29 commented Feb 28, 2023 • edited Loading

schmichael left a comment

Choose a reason for hiding this comment

lgfa29 commented Mar 16, 2023

lgfa29 commented Mar 16, 2023

lgfa29 Mar 16, 2023

Choose a reason for hiding this comment

tgross left a comment • edited Loading

Choose a reason for hiding this comment

tgross Mar 17, 2023

Choose a reason for hiding this comment

lgfa29 Mar 17, 2023

Choose a reason for hiding this comment

tgross Mar 17, 2023

Choose a reason for hiding this comment

lgfa29 Mar 17, 2023

Choose a reason for hiding this comment

tgross Mar 20, 2023

Choose a reason for hiding this comment

lgfa29 commented Feb 28, 2023 •

edited

Loading

tgross left a comment •

edited

Loading