Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provide -no-shutdown-delay flag for job/alloc stop #11596

Merged
merged 2 commits into from
Dec 13, 2021
Merged

Conversation

tgross
Copy link
Member

@tgross tgross commented Dec 1, 2021

Fixes #11448

Some operators use very long group/task shutdown_delay settings to
safely drain network connections to their workloads after service
deregistration. But during incident response, they may want to cause
that drain to be skipped so they can quickly shed load.

Provide a -no-shutdown-delay flag on the nomad alloc stop and
nomad job stop commands that bypasses the delay. This sets a new
desired transition state on the affected allocations that the
allocation/task runner will identify during pre-kill on the client.

Note (as documented here) that using this flag will almost always
result in failed inbound network connections for workloads as the
tasks will exit before their clients receive updated service discovery
information and won't be gracefully drained.

@tgross tgross self-assigned this Dec 1, 2021
@vercel vercel bot temporarily deployed to Preview – nomad December 1, 2021 20:16 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad December 2, 2021 01:35 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad December 2, 2021 02:37 Inactive
@tgross tgross marked this pull request as ready for review December 2, 2021 03:15
@tgross tgross marked this pull request as draft December 2, 2021 13:34
@tgross
Copy link
Member Author

tgross commented Dec 2, 2021

Moving this PR back into draft state because I think I figured out a way to workaround the limitation I've documented here about stopping a task that's already waiting. Fixed.

Some operators use very long group/task `shutdown_delay` settings to
safely drain network connections to their workloads after service
deregistration. But during incident response, they may want to cause
that drain to be skipped so they can quickly shed load.

Provide a `-no-shutdown-delay` flag on the `nomad alloc stop` and
`nomad job stop` commands that bypasses the delay. This sets a new
desired transition state on the affected allocations that the
allocation/task runner will identify during pre-kill on the client.

Note (as documented here) that using this flag will almost always
result in failed inbound network connections for workloads as the
tasks will exit before clients receive updated service discovery
information and won't be gracefully drained.
Copy link
Contributor

@DerekStrickland DerekStrickland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job! No blockers. Couple of suggestions for a grammar fix in the doc text that is included in several places.

command/job_stop.go Outdated Show resolved Hide resolved
@@ -1600,7 +1600,7 @@ func (s *StateStore) upsertJobImpl(index uint64, job *structs.Job, keepVersion b
}

if err := s.updateJobCSIPlugins(index, job, existingJob, txn); err != nil {
return fmt.Errorf("unable to update job scaling policies: %v", err)
return fmt.Errorf("unable to update job csi plugins: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional for this change set?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, just a fixing a nearby typo while I was here.

website/content/docs/commands/alloc/stop.mdx Outdated Show resolved Hide resolved
website/content/docs/commands/job/stop.mdx Outdated Show resolved Hide resolved
Co-authored-by: Derek Strickland <[email protected]>
@github-actions
Copy link

github-actions bot commented Nov 7, 2022

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

provide -no-shutdown-delay flag on job/alloc stop to bypass kill timeout
2 participants