Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable delay between deregistering service and killing task #2441

Closed
bsphere opened this issue Mar 13, 2017 · 14 comments
Closed

Configurable delay between deregistering service and killing task #2441

bsphere opened this issue Mar 13, 2017 · 14 comments

Comments

@bsphere
Copy link

bsphere commented Mar 13, 2017

I believe that for true rolling updates of jobs, the updated alloc service endpoints should be removed from Consul first, wait for a grace period for active connections to drain and then restart..

@dadgar
Copy link
Contributor

dadgar commented Mar 13, 2017

@bsphere Nomad can't decide what that grace period is as it varies per job. The correct way to handle this is Nomad sends a signal that the application is being shutdown. The application should then fail its health check which will make consul not route traffic to that instance while it starts draining connections/work and then it should exit.

The service exists and thus is registered in Consul, the only thing changing is its status which is reflected by checks.

@bsphere
Copy link
Author

bsphere commented Mar 13, 2017 via email

@jemc
Copy link

jemc commented May 4, 2017

I think this feature is really important, and that counting on the application to handle it is not always feasible solution.

The first part of this solution (deregistering the consul service first, before initiating the kill sequence) was achieved in #2596.

I believe the next important step is to introduce a delay between the service deregistration and the kill, configurable as part of the Nomad job spec, with the intent of giving other services in a distributed system (like a load balancer) ample time to stop interacting with the service before it is killed.

Please see relevant discussion in #2607 and #2596. I think I've made some important arguments there that haven't been raised here in this ticket yet.

@ygersie
Copy link
Contributor

ygersie commented Jul 27, 2017

Agree with everything @jemc posted. Ideally Nomad would put the related Consul service into maintenance mode with a configurable timeout (default of 1 second would already be enough in most cases) before initiating de-registration and SIGTERM. This is especially troublesome right now in combination with github.com/eBay/fabio. It takes a couple of 10's of ms before Fabio removes the route which leads to client side 503's. This is fairly problematic and I don't see a real nice solution for it except for introducing extra logic in all of our services.

This seems like a fairly trivial thing for Nomad to provide as opposed to the amount of development required to get every service handle a SIGTERM by first failing the health check, waiting and then shutting down.

@ygersie
Copy link
Contributor

ygersie commented Jul 27, 2017

Also consider the fact that not every service we run with Nomad is under our control (Nginx would be 1 of them).

@ygersie
Copy link
Contributor

ygersie commented Jul 27, 2017

I think the title of this issue should be renamed to Graceful shutdown or something, as this applies to all variations of stopping allocations (drain, stop job, deploy).

@mlehner616
Copy link

mlehner616 commented Jul 27, 2017 via email

@dadgar dadgar changed the title Feature request: remove endpoint from Consul and wait for connection draining during "rolling updates" Configurable delay between deregistering service and killing task Jul 28, 2017
@skyrocknroll
Copy link

This is really important for us. Right now we ignore the softkill signal so consul service gets De registered and a delay with kill_timeout. After that container is brutally killed. Providing a delay config would help us handling everything Gracefully @dadgar

@schmichael
Copy link
Member

schmichael commented Aug 16, 2017

Proposal:

job "docs" {
  group "example" {
    task "server" {
      # ...
      
      # Delay between deregister and kill signal
      shutdown_delay = "5s"
    }
  }
}

Where shutdown_delay is the duration between deregistering services from Consul and sending the task the shutdown signal.

Defaults to 0 for backward compat.

schmichael added a commit that referenced this issue Aug 17, 2017
Fixes #2441

Defaults to 0 (no delay) for backward compat and because this feature
should be opt-in.
@skyrocknroll
Copy link

@schmichael This is just insanely awesome. Thanks ❤️ 💯

@schmichael
Copy link
Member

Thanks for the input everyone! 0.6.1 should be coming out soon with this feature.

@mlehner616
Copy link

@schmichael thank you for the attention on this, this will help with draining services a ton!

@ygersie
Copy link
Contributor

ygersie commented Aug 19, 2017

Thanks @smichael very helpful!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants