-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quantify the impact of and optimize the schedule loop's job iteration scheme #183
Comments
To avoid premature optimization, we will probably want to look at the performance impacts of this on conservative and hybrid backfill algorithms. In terms of peeling this onion in a right way, it would be good to add optimization for #182 first and see the impact of this issue on conservative and hybrid after that. |
As suggested in distribution #14, I recommend adding to the sched comms module a setting called "queue-depth" that will impose a limit on the number of jobs considered in each scheduling loop. This would apply for both the FCFS and backfill plugins. |
This is reasonable, but as you and I discussed (before I realized I had a meeting with Tapasya), my main question is how the queue depth should play with the reservation depth, which we currently use for hybrid backfill scheduling. If that makes sense. Should queue-depth take precedence over reservation depth? Then, if queue-depth < reservation depth in hybrid backfill, we print out an error or warning message: this policy is not fully supported? Or would it be better to use an alternative parameter which tells the scheduler to consider x number of jobs after reservation depth is satisfied? As you argue, the downside of this is, it may not work with FCFS as its concept for reservation is implicit. But at least it won't change the policy. In any case, one thing we will probably want to do is to introduce a max for the pure conservative backfill mode. (Make it also configurable). We probably need to understand what that max should be in a stress load like distribution #14. The other reasonable optimization is not to schedule for each and every job/resource event. |
OK. I did a quick test for this by just limiting the schedule requested jobs to 100 and this makes a pretty substantial difference (2x): Jobs Executed Per Second (JEPS) at <2 nodes, 1 broker per node, unit sizing policy, non-mpi sleep, FCFS>: 3.5. |
In order to reason about the performance in a more comprehensive way, I am pretty convinced that I will need a performance model for our scheduler. I will spend some time to come up w/ one. |
@dongahn , With regard to reservation and queue depths, any warning should come from the backfill plugin if the requested reservation depth exceeds the queue depth. Queue depth will take precedence over reservation depth. Given that reservation depth is only relevant to the backfill plugin, I am not in favor of creating a queue depth that is tied to reservation depth. |
@lipari: OK. This makes sense to me. I will see what I can. In terms of comparing our notes, here is my $.02 on the scheduling control parameters. I believe that it is not only "scheduler plugins" but also "scheduling control parameters set", that will represent choices a user can make to specialize the scheduling policy on his/her flux instance. E.g., if a user wants HTC UQ scheduling, he may want to instantiate our EASY backfill scheduler w/ a short queue-depth on his flux instance. Users can write and instantiate their own custom scheduling plugin, but most of users will probably want to use ready-made choices. So, one thing I would like to ensure is that we make scheduling control parameters are not only configurable and manageable, but also have clear semantics. Like you suggested, having compatibility checks is clearly one of the ways to convey this. If this space become large in the end, things can be quickly unwieldy. So what we may ultimately need would a catalogue which names some of the well-known choices. Now, I looked at SLURM's control parameters and it seems they are not as intuitive as one would hope, and this may be an area where we can do much better. From @surajpkn's recent email
BTW, one of the reasons why I wanted to build a performance model of the scheduler to reason about this space... Sorry for the long tangent. |
I was wondering how this discussion could lead to a performance improvement to FCFS. Glancing again at the code, it seems that the FCFS scheduler loops over every job and schedules any jobs that can be scheduled at the current time. If I'm reading the code correctly, this actually isn't FCFS, this is backfilling without a reservation. (I think I am missing something though). For FCFS, once the first jobs fails, we should break out of the For the backfilling scheduler, I agree with the existing discussion that the |
@lipari can weigh in. If I understood him right yesterday, FCFS can do out of order scheduling if a high priority job requests nodes with specific constraints (e.g., bigger memory etc). |
If all of the resources being scheduled were identical, then the FCFS scheduler could be simplified to stop searching down the job queue once the first reservation had been made. The current FCFS scheduler supports heterogeneous resources, nodes with more memory, different number of sockets, etc. So using @dongahn 's example, all the big memory nodes could be reserved for the top priority pending job, while small memory nodes could be allocated to lower priority jobs that will accept small memory nodes. |
First thoughts on the performance model. Perhaps we can model the overall scheduling overheads as the combination of the following overhead terms:
Then,
If KVS with many front-loaded submissions cause us problems, then If the sum of the jobs' computation is not much greater (>>) then T, then we are scheduling-bound for that workload... This kind of reasoning may help @SteVwonder on his "scheduler scalability" metric as well. I will think some more on this after a concall before doing some work for it. |
I see now. We have different definitions of FCFS. I have been operating under the assumption that FCFS means that jobs will be started in the exact order that they are submitted. So if you produce two lists of the jobs, one sorted based on the jobs' start times and the other sorted based on the jobs' submit times, the two lists should exactly match. I think we are getting away from the original purpose of this issue though. I think we should move the discussion over to #168, since for the validation, we will need to rigorously define the behaviors that we expect from the different scheduling plugins with various parameters. |
Close via #190 |
We may need an optimization for this code as captured in flux-distribution#14: specifically, this code. Once an optimization is done, please rerun tests.
The text was updated successfully, but these errors were encountered: