-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configurable delay between deregistering service and killing task #2441
Comments
@bsphere Nomad can't decide what that grace period is as it varies per job. The correct way to handle this is Nomad sends a signal that the application is being shutdown. The application should then fail its health check which will make consul not route traffic to that instance while it starts draining connections/work and then it should exit. The service exists and thus is registered in Consul, the only thing changing is its status which is reflected by checks. |
Seems like a possible solution, that requires support from the task side.
What about having the grace period in the job settings? This way "legacy"
code is still supported
On Mar 13, 2017 21:53, "Alex Dadgar" <[email protected]> wrote:
@bsphere <https://github.com/bsphere> Nomad can't decide what that grace
period is as it varies per job. The correct way to handle this is Nomad
sends a signal that the application is being shutdown. The application
should then fail its health check which will make consul not route traffic
to that instance while it starts draining connections/work and then it
should exit.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2441 (comment)>,
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AB2LTHwPuwQCo8RuD7W4UcISJ4cT7PrJks5rlZ7DgaJpZM4MbuDq>
.
|
I think this feature is really important, and that counting on the application to handle it is not always feasible solution. The first part of this solution (deregistering the consul service first, before initiating the kill sequence) was achieved in #2596. I believe the next important step is to introduce a delay between the service deregistration and the kill, configurable as part of the Nomad job spec, with the intent of giving other services in a distributed system (like a load balancer) ample time to stop interacting with the service before it is killed. Please see relevant discussion in #2607 and #2596. I think I've made some important arguments there that haven't been raised here in this ticket yet. |
Agree with everything @jemc posted. Ideally Nomad would put the related Consul service into maintenance mode with a configurable timeout (default of 1 second would already be enough in most cases) before initiating de-registration and SIGTERM. This is especially troublesome right now in combination with github.com/eBay/fabio. It takes a couple of 10's of ms before Fabio removes the route which leads to client side 503's. This is fairly problematic and I don't see a real nice solution for it except for introducing extra logic in all of our services. This seems like a fairly trivial thing for Nomad to provide as opposed to the amount of development required to get every service handle a SIGTERM by first failing the health check, waiting and then shutting down. |
Also consider the fact that not every service we run with Nomad is under our control (Nginx would be 1 of them). |
I think the title of this issue should be renamed to Graceful shutdown or something, as this applies to all variations of stopping allocations (drain, stop job, deploy). |
@dropje86 Thank you for posting this, i actually have a half written issue
that I was about to post today for exactly the same thing. This also
particularly affects consul integration in regard to templates and the
change_signal. The other use case is on deploy as well. It seems like nomad
should have all the information it needs to trigger a `consul maint` or
deregister and THEN kill/signal the alloc. This is going to be a big
problem as we can't have client connections just simply dropped as we run
deploys or change consul values.
For deploys there is pretty a fairly straightforward work around of
triggering a consul maint during the process but I think the use case we
we'd have to have nomad do it is during that consul kv update.
…On Thu, Jul 27, 2017 at 1:54 PM dropje86 ***@***.***> wrote:
I think the title of this issue should be renamed to Graceful shutdown or
something, as this applies to all variations of stopping jobs (drain, stop
job, deploy).
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2441 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAqFRNl29lg31wGlJZ7MU-PT7isFXAjjks5sSPkNgaJpZM4MbuDq>
.
|
This is really important for us. Right now we ignore the softkill signal so consul service gets De registered and a delay with kill_timeout. After that container is brutally killed. Providing a delay config would help us handling everything Gracefully @dadgar |
Proposal: job "docs" {
group "example" {
task "server" {
# ...
# Delay between deregister and kill signal
shutdown_delay = "5s"
}
}
} Where Defaults to |
Fixes #2441 Defaults to 0 (no delay) for backward compat and because this feature should be opt-in.
@schmichael This is just insanely awesome. Thanks ❤️ 💯 |
Thanks for the input everyone! 0.6.1 should be coming out soon with this feature. |
@schmichael thank you for the attention on this, this will help with draining services a ton! |
Thanks @smichael very helpful! |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
I believe that for true rolling updates of jobs, the updated alloc service endpoints should be removed from Consul first, wait for a grace period for active connections to drain and then restart..
The text was updated successfully, but these errors were encountered: