-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad CLI should allow multiple jobs in stop command #2390
Comments
I see the utility of this, but also question what the implementation should look like. Mainly, if one is stopping thousands of jobs, then it implies that there is some scripting involved (to get the job names in the first place). At that point, what practical difference is there between issuing My thoughts are that some sort of name or tag filtering would be the most appropriate way to implement it, but maybe I am overlooking something? |
I will use my exact scenario that happened. I had nearly 5k jobs that needed to be stopped and it took a little over 2hrs to do in a for loop. If one can stop many jobs in a single go, without the need to submit a single command each time for every job and wait for that command to give a status, this could be dropped down to minutes. |
@gehzumteufel Did you use |
@gehzumteufel Ah, I see. So not just a nomad stop, but a stop that works in parallel. I think @dadgar's idea is workable, but solutions that involve external scripting (since you might not want to move on until everything is down) can be failure prone. |
We are also finding that stopping a large number of jobs in sequence is very time consuming. We would like to be able to Also why do all hashicorp products only use single |
A Go-ism (from Plan 9 before that?) we chose to keep I'm afraid: https://golang.org/pkg/flag/ |
In case anybody is here looking for this basic functionality that Nomad should provide but doesn't, and you have likely made a mistake by choosing this stack, you can at least try this: echo "Killing dispatch jobs... (This may take a while.)"
if [[ $(nomad status) != "No running jobs" ]]; then
for job in $(nomad status | awk {'print $1'} || grep /)
do
# Skip the header row for jobs.
if [ $job != "ID" ]; then
nomad stop -purge -detach $job > /dev/null
fi
done
fi |
With use of GNU parallel you can speed this up significantly
Adjust the number of cores as you see fit |
Or just kill all pending jobs. It's pretty bad for operators that this behaviour is not built-in. |
Heads up for those who wants to try out Nomad for production: I just spent the last one hour to try to kill 700 jobs that is causing the cluster to freeze. Still ongoing. |
@schmichael Looked at |
@danishprakash, I just checked with engineering and that sounds good! If you pick this up and want some guidance, please let us know. And feel free to open a WIP/draft PR too - doesn't have to be perfect before getting feedback. Thank you! |
tl;dr - +1 to @danishprakash @danishprakash Sounds good to me! Concurrent vs sequential is an interesting question, but I think your choice - concurrent - is the right one. Sequential is easy enough to script already, and a concurrent stop operation could someday call a batch/atomic stop API which in @BirkhoffLee's case could be a significant optimization! 1 command, API call, and Raft commit instead of 700 of each. Concurrent implies we attempt to stop all jobs even if any of them encounter an error. That means in the case of a missing ACL token we'll be spewing 1 error per job listed, but I think that's ok. I think halting on the first error encountered would be far worse as it would be difficult to know what got stopped successfully and what didn't. So the design is:
|
Funny that I reported this years ago, with the same exact issue that @BirkhoffLee had, because I had 5000 to kill. Appreciate that I wasn't the only one. |
@schmichael thanks for the helpful summary. I've started looking into this, just trying to understand the relevant pieces right now before making any changes, I'll open a draft PR soon.
I think this stemmed from seeing how
Wait, does this mean out of the stop cmd context or did I miss something here? |
I meant that if a user runs:
...and jobDoesNotExist doesn't exist: Basically the same as using bash job control:
|
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Currently one is only able to feed a single job ID into the
nomad stop <job>
command. It would be great if this could allow for many as opposed to just a single one. That way one could stop thousands of jobs at once if necessary as opposed to needing a for loop or similar single execution per job.The text was updated successfully, but these errors were encountered: