Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shutdown_delay not considered /w group defined services #6704

Closed
djenriquez opened this issue Nov 14, 2019 · 9 comments
Closed

Shutdown_delay not considered /w group defined services #6704

djenriquez opened this issue Nov 14, 2019 · 9 comments
Assignees
Labels
Milestone

Comments

@djenriquez
Copy link

djenriquez commented Nov 14, 2019

Nomad version

Output from nomad version

$ nomad -v
Nomad v0.10.0 (25ee121d951939504376c70bf8d7950c1ddb6a82)

Operating system and Environment details

Amazon Linux 2

Issue

Allocations on shutdown do not seem to be respecting the shutdown_delay. I believe this may be because before, when services were mapped to a task, there is a 1:N correlation on which consul services to deregister before sending the kill signal. Now that the task does not have a service defined (since its in the group level), I believe it is completely ignoring shutdown_delay.

We see this happening in our production environment, where on an allocation shutdown, a kill signal is sent and the service terminates almost immediately, even though we have a shutdown-delay defined as 10s for the tasks within the group, resulting in problematic 502s.

Is this a known issue/regression from upgrading to network namespaces? Should there be a group-level shutdown_delay field introduced?

I see that shutdown_delay is included for the sidecar_task stanza, should this have been included in the more generic group stanza ??

@schmichael schmichael added this to the near-term milestone Nov 14, 2019
@schmichael
Copy link
Member

Thanks for the bug report @djenriquez! Shutdown delay was only implemented for task services, but should apply when using group services as well. The 2 implementation options I can think of are:

  1. Each task respects its own shutdown_delay.
  2. New group level shutdown_delay.

@djenriquez
Copy link
Author

@schmichael Thank you very much for the quick response! Not sure how difficult this work would be, sounds like it would require some struct changes, but would this be a quick one?

I think there is patience internally since our apps are mostly fault-tolerant, but if not, we may need to revert out of network namespaces as the 502s are not pretty to see in our highly dynamic environment.

@djenriquez
Copy link
Author

djenriquez commented Nov 14, 2019

Also regarding option 1, I'm not sure how that would work since the services being registered would represent all tasks in the group. You'd probably have to introduce logic to use the greatest shutdown delay of all the tasks.

@schmichael
Copy link
Member

#1 wouldn't require any struct changes but is arguably the least user friendly: when an allocation is killed each task would wait its own shutdown_delay between deregistering services and sending the signal. So if you have 3 tasks in a group and only 1 sets shutdown_delay -- 2 of the tasks would be killed immediately.

@djenriquez
Copy link
Author

Ah I see, shutdown signal would be handled differently for each task. That makes a lot of sense, thanks for clarifying.

@danlsgiga
Copy link
Contributor

danlsgiga commented Nov 15, 2019

Another scenario for shutdown_delay is for sidecar jobs.

In my specific case, I have batch periodic jobs running every hour... they run really fast and generate some logs.

I have filebeat running in the same task group to send the logs to logstash but what I noticed is, the leader task finishes and filebeat did not have a chance to push the logs yet.

I have shutdown_delay = "30s" set in the filebeat task but that is not applied / respected when the leader finishes and the filebeat task is instructed to exit.

@drewbailey drewbailey self-assigned this Dec 2, 2019
@tgross tgross modified the milestones: near-term, unscheduled , 0.10.3 Jan 9, 2020
@tgross
Copy link
Member

tgross commented Jan 9, 2020

@drewbailey should this issue have been closed by #6746?

@drewbailey
Copy link
Contributor

Yes thanks, not sure why it didn't auto-close :(

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants