Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: job restart command #16278

Merged
merged 27 commits into from
Mar 23, 2023
Merged

cli: job restart command #16278

merged 27 commits into from
Mar 23, 2023

Conversation

lgfa29
Copy link
Contributor

@lgfa29 lgfa29 commented Feb 28, 2023

Implement the new nomad job restart command that allows operators to restart allocations tasks or reschedule then entire allocation.

Restarts can be batched to target multiple allocations in parallel. Between each batch the command can stop and hold for a predefined time or until the user confirms that the process should proceed.

This implements the "Stateless Restarts" alternative from the original RFC
(https://gist.github.com/schmichael/e0b8b2ec1eb146301175fd87ddd46180). The original concept is still worth implementing, as it allows this functionality to be exposed over an API that can be consumed by the Nomad UI and other clients. But the implementation turned out to be more complex than we initially expected so we thought it would be better to release a stateless CLI-based implementation first to gather feedback and validate the restart behaviour.

Closes #698
Closes #10391

shishir-a412ed and others added 2 commits February 28, 2023 19:06
Implement the new `nomad job restart` command that allows operators to
restart allocations tasks or reschedule then entire allocation.

Restarts can be batched to target multiple allocations in parallel.
Between each batch the command can stop and hold for a predefined time
or until the user confirms that the process should proceed.

This implements the "Stateless Restarts" alternative from the original
RFC
(https://gist.github.com/schmichael/e0b8b2ec1eb146301175fd87ddd46180).
The original concept is still worth implementing, as it allows this
functionality to be exposed over an API that can be consumed by the
Nomad UI and other clients. But the implementation turned out to be more
complex than we initially expected so we thought it would be better to
release a stateless CLI-based implementation first to gather feedback
and validate the restart behaviour.
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, WIP. Will return to it.

command/job_restart.go Outdated Show resolved Hide resolved
command/job_restart.go Outdated Show resolved Hide resolved
command/job_restart.go Outdated Show resolved Hide resolved
command/job_restart.go Outdated Show resolved Hide resolved
command/job_restart.go Outdated Show resolved Hide resolved
website/content/docs/commands/job/restart.mdx Outdated Show resolved Hide resolved
website/content/docs/commands/job/restart.mdx Show resolved Hide resolved
website/content/docs/commands/job/restart.mdx Outdated Show resolved Hide resolved
website/content/docs/commands/job/restart.mdx Outdated Show resolved Hide resolved
command/job_restart.go Outdated Show resolved Hide resolved
command/job_restart.go Outdated Show resolved Hide resolved
command/job_restart.go Outdated Show resolved Hide resolved
command/job_restart.go Show resolved Hide resolved
command/job_restart.go Outdated Show resolved Hide resolved
website/content/docs/commands/job/restart.mdx Show resolved Hide resolved
command/job_restart.go Outdated Show resolved Hide resolved
command/job_restart.go Outdated Show resolved Hide resolved
command/job_restart.go Outdated Show resolved Hide resolved
command/job_restart.go Outdated Show resolved Hide resolved
command/job_restart.go Show resolved Hide resolved
command/job_restart.go Show resolved Hide resolved
command/job_restart.go Outdated Show resolved Hide resolved
command/job_restart.go Outdated Show resolved Hide resolved
command/job_restart.go Outdated Show resolved Hide resolved
lgfa29 added 3 commits March 1, 2023 20:15
The previous implementation would only monitor the eval resulting from
the alloc stop action, but when multiple allocations for the same task
group are stopped at the same time (which can happen with `-batch-size`
> 1) and the replacement allocations failed to be placed only one of the
evals is blocked and include the failed metrics. The second eval is just
cancelled without any reference to the blocked eval, causing the
goroutine to be stuck waiting for a replacement alloc preventing the
whole command from exiting.

The new implementation monitors all evals for the job until a blocked
eval is found that contains failed placement metrics for the alloc task
group, or a replacement alloc succeeds.
@lgfa29
Copy link
Contributor Author

lgfa29 commented Mar 16, 2023

TestJobRestartCommand_shutdownDelay_reschedule is flaky. I will refactor it so it's less time sensitive.

Fixed by adding a safety margin

@lgfa29
Copy link
Contributor Author

lgfa29 commented Mar 16, 2023

@tgross @schmichael this PR is ready for another round of review. There were a few changes from the last round that, unfortunately, didn't fall neatly into separate commits, but these were the latest changes:

  • Batch error handling is now done using the -on-error flag. Since we now have up to two questions that we ask users between batches (-batch-wait and -on-error) I refactor the logic to make sure we only ask them once.
  • I realized that I could wrap the Meta.Ui in a *cli.ConcurrentUi instead of having two UIs in the command, which was confusing. But this required me to do some reflect magic to "unwrap" the UIs.
  • I added a few more output messages, specially when -reschedule is used because the process can take a while (or not progress at all!). I then noticed that calling monitorPlacementFailures() per alloc is kind of redundant because evals are per job anyway, but I couldn't find a nice way to have a single monitor goroutine that didn't involve a major refactoring. In hindsight the event stream would probably be better here.
  • Handle multi-region job by asking the user for confirmation that it's OK to restart the job in a single region. Coordinated multi-region job restart is left as a future work.
  • Since there were so many questions to ask, I created askQuestion() to format the question and handle inputs in a more standardized way.

Here's the diff of the new commits:
https://github.com/hashicorp/nomad/pull/16278/files/971ec5d4b55e01f719c0aa294c39b1f9f2066323..8cdcc1411716ec4a60ca881b4e2559898c8c2748

@lgfa29 lgfa29 requested review from tgross and schmichael March 16, 2023 20:54
// The flags -batch-wait and -on-error have an 'ask' option. This function
// handles both to prevent asking the user twice in case they are both set to
// 'ask' and an error happens.
func (c *JobRestartCommand) shouldProceed(err error) bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a way to test this function since it requires user input and messing with os.Stdin during tests seemed dangerous 😬

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! This whole feature is a masterclass in the difference between incidental complexity and intrinsic complexity. 😀

I've left a fairly long comment around what happens during deployments, but that's a one-liner to fix so I'm 👍 on shipping once that's handled.

}()

for {
answer, err := c.Ui.Ask(c.Colorize().Color(prefixedQuestion))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While doing some testing I noticed that Ask has some weird behavior around buffering where some errant keypresses after Ctrl-C got dropped into the input. I'm not sure if we can actually do anything about it here in Nomad, but just leaving a note in case we encounter it again:

    2023-03-17T10:06:07-04:00: Still waiting for allocation "1b86450c" to be replaced
^C^[[A^[[A^[^[==> 2023-03-17T10:06:13-04:00: Restart interrupted, no more allocations will be restarted.
                               Are you sure you want to stop the restart process? [y/N] y
                               Invalid answer "\x1b[A\x1b[A\x1b\x1by"
==> 2023-03-17T10:06:13-04:00: Restart interrupted, no more allocations will be restarted.
                               Are you sure you want to stop the restart process? [y/N] y

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah interesting! We could try to remove signal characters from the input Ask gives us, but the list of signals and their values may change from system to system.

It may also be possible to Ignore all other signals while an input is being provided, but that can be pretty annoying since it breaks the expected flow of signal handling.

All of this to say that I agree there's not much we can do here...

Comment on lines +265 to +270
// Use prefix matching to find job.
job, err := c.JobByPrefix(c.client, c.jobID, nil)
if err != nil {
c.Ui.Error(err.Error())
return 1
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During testing I discovered that passing -reschedule while there's an ongoing deployment pretty much always results in this command hanging. I used a job with a canary and manual promotion in its update block.

First attempt was running the restart while the deployment was waiting for manual promotion:

$ nomad job restart -reschedule -batch-wait=ask example
==> 2023-03-17T09:55:39-04:00: Restarting 7 allocations
    2023-03-17T09:55:39-04:00: Rescheduling allocation "3958e815" for group "group"
^C==> 2023-03-17T09:57:36-04:00: Restart interrupted, no more allocations will be restarted.
                               Are you sure you want to stop the restart process? [y/N]
    2023-03-17T09:57:39-04:00: Still waiting for allocation "3958e815" to be replaced
    2023-03-17T09:58:39-04:00: Still waiting for allocation "3958e815" to be replaced

$ nomad job allocs example
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
c7bfb048  22e1813e  group       1        run      running   52s ago    21s ago
3958e815  22e1813e  group       1        stop     complete  1m36s ago  46s ago
65110380  22e1813e  group       0        run      running   2m5s ago   1m48s ago
976fdbf3  22e1813e  group       0        run      running   2m13s ago  1m57s ago
a422b9db  22e1813e  group       0        run      running   2m22s ago  2m6s ago
a7c3945c  22e1813e  group       0        run      running   2m32s ago  2m16s ago
6fbff862  22e1813e  group       0        run      running   2m40s ago  2m24s ago
16472afa  22e1813e  group       0        run      running   4m44s ago  4m27s ago

$ nomad deployment list
ID        Job ID   Job Version  Status      Description
04edbf0c  example  1            running     Deployment is running but requires manual promotion
02cf29ff  example  0            successful  Deployment completed successfully

After canceling that and promoting the deployment, I deployed again. This time I waited until after promotion. In the example below, the allocation 1b86450c was in a stopped state, having been replaced by the deployment:

$ nomad job restart -reschedule -batch-wait=ask example
==> 2023-03-17T10:03:37-04:00: Restarting 6 allocations
    2023-03-17T10:03:37-04:00: Rescheduling allocation "95f54f2c" for group "group"
    2023-03-17T10:03:37-04:00: Allocation "95f54f2c" replaced by "f254e087", waiting for "f254e087" to start running
    2023-03-17T10:03:43-04:00: Allocation "f254e087" is "running"
==> 2023-03-17T10:03:43-04:00: Proceed with the next batch? [Y/n/<wait duration>] 1s
==> 2023-03-17T10:03:59-04:00: Proceeding restarts with new wait time of 1s
==> 2023-03-17T10:03:59-04:00: Waiting 1s before restarting the next batch
    2023-03-17T10:04:00-04:00: Rescheduling allocation "57654725" for group "group"
    2023-03-17T10:04:00-04:00: Allocation "57654725" replaced by "7bce4be0", waiting for "7bce4be0" to start running
    2023-03-17T10:04:06-04:00: Allocation "7bce4be0" is "running"
==> 2023-03-17T10:04:06-04:00: Waiting 1s before restarting the next batch
    2023-03-17T10:04:07-04:00: Rescheduling allocation "1b86450c" for group "group"
    2023-03-17T10:05:07-04:00: Still waiting for allocation "1b86450c" to be replaced
    2023-03-17T10:06:07-04:00: Still waiting for allocation "1b86450c" to be replaced

My take on this is that we probably shouldn't support running job restart during an active deployment at all, because it makes it impossible for the deployment watcher to reason about what's going on. We could block that at this line by checking for active deployments and returning an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! We do the same thing for the autoscaler because messing with allocations during deployments is a recipe for disaster 😅

I think there's a link missing in your comment, but I implemented the check in 069d2c6 before each restart. I also didn't gate it on -reschedule because I think even an in-place restart can be dangerous.

While working on this I realized that the core issue of the command getting blocked during a deployment was a bug in monitorReplacementAlloc. We need to keep following NextAllocation because, for example, if the first replacement fails Nomad will keep creating follow-up allocations. I fixed this in fde22d6.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a link missing in your comment, but I implemented the check in 069d2c6 before each restart.

By "this line" I meant "here, right after JobByPrefix. But where you put it is much better, because that'll protect us from a time-of-check-time-of-use bug. 👍

I also didn't gate it on -reschedule because I think even an in-place restart can be dangerous.

Good call

While working on this I realized that the core issue of the command getting blocked during a deployment was a bug in monitorReplacementAlloc. We need to keep following NextAllocation because, for example, if the first replacement fails Nomad will keep creating follow-up allocations. I fixed this in fde22d6.

Nice catch!

lgfa29 added 2 commits March 17, 2023 17:08
The `job restart` command monitors the allocation that was stopped and
its follow-up until it is running. But in some scenarios is may be
possible for the follow-up allocation itself to be replaced, such as in
case the new allocation fails.

In this scenario the command needs to keep following the
`NextAllocation` ID until it finds one that is running.
@lgfa29 lgfa29 added the backport/1.5.x backport to 1.5.x release line label Mar 23, 2023
@lgfa29 lgfa29 merged commit fffdbdf into main Mar 23, 2023
@lgfa29 lgfa29 deleted the f-job-restart-cmd branch March 23, 2023 22:28
lgfa29 added a commit that referenced this pull request Mar 23, 2023
Implement the new `nomad job restart` command that allows operators to
restart allocations tasks or reschedule then entire allocation.

Restarts can be batched to target multiple allocations in parallel.
Between each batch the command can stop and hold for a predefined time
or until the user confirms that the process should proceed.

This implements the "Stateless Restarts" alternative from the original
RFC
(https://gist.github.com/schmichael/e0b8b2ec1eb146301175fd87ddd46180).
The original concept is still worth implementing, as it allows this
functionality to be exposed over an API that can be consumed by the
Nomad UI and other clients. But the implementation turned out to be more
complex than we initially expected so we thought it would be better to
release a stateless CLI-based implementation first to gather feedback
and validate the restart behaviour.

Co-authored-by: Shishir Mahajan <[email protected]>
lgfa29 added a commit that referenced this pull request Mar 23, 2023
Implement the new `nomad job restart` command that allows operators to
restart allocations tasks or reschedule then entire allocation.

Restarts can be batched to target multiple allocations in parallel.
Between each batch the command can stop and hold for a predefined time
or until the user confirms that the process should proceed.

This implements the "Stateless Restarts" alternative from the original
RFC
(https://gist.github.com/schmichael/e0b8b2ec1eb146301175fd87ddd46180).
The original concept is still worth implementing, as it allows this
functionality to be exposed over an API that can be consumed by the
Nomad UI and other clients. But the implementation turned out to be more
complex than we initially expected so we thought it would be better to
release a stateless CLI-based implementation first to gather feedback
and validate the restart behaviour.

Co-authored-by: Luiz Aoqui <[email protected]>
Co-authored-by: Shishir Mahajan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.5.x backport to 1.5.x release line theme/cli
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feature request] Force(fast) job restart Add ability to restart all running tasks/allocs of a job
4 participants