Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad CLI should allow multiple jobs in stop command #2390

Closed
bengaywins opened this issue Mar 2, 2017 · 17 comments · Fixed by #12582
Closed

Nomad CLI should allow multiple jobs in stop command #2390

bengaywins opened this issue Mar 2, 2017 · 17 comments · Fixed by #12582
Labels
help-wanted We encourage community PRs for these issues! theme/cli type/enhancement

Comments

@bengaywins
Copy link

Currently one is only able to feed a single job ID into the nomad stop <job> command. It would be great if this could allow for many as opposed to just a single one. That way one could stop thousands of jobs at once if necessary as opposed to needing a for loop or similar single execution per job.

@nugend
Copy link

nugend commented Mar 3, 2017

I see the utility of this, but also question what the implementation should look like. Mainly, if one is stopping thousands of jobs, then it implies that there is some scripting involved (to get the job names in the first place). At that point, what practical difference is there between issuing nomad stop <job> thousands of times in a loop?

My thoughts are that some sort of name or tag filtering would be the most appropriate way to implement it, but maybe I am overlooking something?

@bengaywins
Copy link
Author

bengaywins commented Mar 3, 2017

I will use my exact scenario that happened. I had nearly 5k jobs that needed to be stopped and it took a little over 2hrs to do in a for loop. If one can stop many jobs in a single go, without the need to submit a single command each time for every job and wait for that command to give a status, this could be dropped down to minutes.

@dadgar
Copy link
Contributor

dadgar commented Mar 3, 2017

@gehzumteufel Did you use nomad stop -detach <job-id>.

@nugend
Copy link

nugend commented Mar 3, 2017

@gehzumteufel Ah, I see. So not just a nomad stop, but a stop that works in parallel.

I think @dadgar's idea is workable, but solutions that involve external scripting (since you might not want to move on until everything is down) can be failure prone.

@Miserlou
Copy link

Miserlou commented Mar 15, 2018

We are also finding that stopping a large number of jobs in sequence is very time consuming. We would like to be able to nomad stop --all.

Also why do all hashicorp products only use single - rather than -- for named arguments. So annoying.

@schmichael
Copy link
Member

Also why do all hashicorp products only use single - rather than -- for named arguments. So annoying.
-- @Miserlou

A Go-ism (from Plan 9 before that?) we chose to keep I'm afraid: https://golang.org/pkg/flag/

@Miserlou
Copy link

Miserlou commented Aug 8, 2018

In case anybody is here looking for this basic functionality that Nomad should provide but doesn't, and you have likely made a mistake by choosing this stack, you can at least try this:

echo "Killing dispatch jobs... (This may take a while.)"
if [[ $(nomad status) != "No running jobs" ]]; then
    for job in $(nomad status | awk {'print $1'} || grep /)
    do  
        # Skip the header row for jobs.
        if [ $job != "ID" ]; then
            nomad stop -purge -detach $job > /dev/null
        fi  
    done
fi

@Fuco1
Copy link
Contributor

Fuco1 commented Jun 8, 2019

With use of GNU parallel you can speed this up significantly

nomad job status YOURJOB | grep pending | awk '{print $1}' > jobs
cat jobs | parallel -j32 nomad job stop -detach -purge 

Adjust the number of cores as you see fit

@analytically
Copy link

Or just kill all pending jobs. It's pretty bad for operators that this behaviour is not built-in.

@BirkhoffLee
Copy link

In case anybody is here looking for this basic functionality that Nomad should provide but doesn't, and you have likely made a mistake by choosing this stack, you can at least try this:

Heads up for those who wants to try out Nomad for production: I just spent the last one hour to try to kill 700 jobs that is causing the cluster to freeze. Still ongoing.

@Amier3 Amier3 added the help-wanted We encourage community PRs for these issues! label Apr 1, 2022
@danishprakash
Copy link
Contributor

@schmichael Looked at Run() for JobStopCommand and it seems like we can accept multiple jobs and then concurrently stop them? I can work on this if that seems like the right direction.

@mikenomitch
Copy link
Contributor

@danishprakash, I just checked with engineering and that sounds good!

If you pick this up and want some guidance, please let us know. And feel free to open a WIP/draft PR too - doesn't have to be perfect before getting feedback. Thank you!

@schmichael
Copy link
Member

schmichael commented Apr 6, 2022

tl;dr - +1 to @danishprakash

@danishprakash Sounds good to me! Concurrent vs sequential is an interesting question, but I think your choice - concurrent - is the right one. Sequential is easy enough to script already, and a concurrent stop operation could someday call a batch/atomic stop API which in @BirkhoffLee's case could be a significant optimization! 1 command, API call, and Raft commit instead of 700 of each.

Concurrent implies we attempt to stop all jobs even if any of them encounter an error. That means in the case of a missing ACL token we'll be spewing 1 error per job listed, but I think that's ok. I think halting on the first error encountered would be far worse as it would be difficult to know what got stopped successfully and what didn't.

So the design is:

  1. Concurrent stops via goroutines in the CLI making independent HTTP requests
  2. Soft-fail on errors (log and allow other operations to continue)
  3. Future Work: batch/atomic stop support in the API/Raft.

@bengaywins
Copy link
Author

Funny that I reported this years ago, with the same exact issue that @BirkhoffLee had, because I had 5000 to kill. Appreciate that I wasn't the only one.

@danishprakash
Copy link
Contributor

@schmichael thanks for the helpful summary. I've started looking into this, just trying to understand the relevant pieces right now before making any changes, I'll open a draft PR soon.

Concurrent vs sequential is an interesting question, but I think your choice - concurrent - is the right one.

I think this stemmed from seeing how kubectl does this. You can pass multiple entities and the client fires off a request to the server and moves on to the next entity. Of course, it might not be a 1-1 implementation here but that felt pretty intuitive. Error handling in that context becomes quite different and equally important as you mentioned.

Concurrent implies we attempt to stop all jobs even if any of them encounter an error.

Wait, does this mean out of the stop cmd context or did I miss something here?

@schmichael
Copy link
Member

@danishprakash

Concurrent implies we attempt to stop all jobs even if any of them encounter an error.

Wait, does this mean out of the stop cmd context or did I miss something here?

I meant that if a user runs:

nomad job stop job1 jobDoesNotExist job2

...and jobDoesNotExist doesn't exist: job1 and job2 should still get stopped successfully. We should display an error for jobDoesNotExist but still stop the other 2 jobs.

Basically the same as using bash job control:

$ nomad job stop foo &
[1] 3507196
$ nomad job stop doesNotExist & # <-- this will end up exiting with an error
[2] 3507206
$ nomad job stop bar &
[3] 3507210
$ wait
[1]   Done                    nomad job stop foo
[2]-  Exit 1                  nomad job stop doesNotExist
[3]+  Done                    nomad job stop bar

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
help-wanted We encourage community PRs for these issues! theme/cli type/enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.