Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantify the impact of and optimize the schedule loop's job iteration scheme #183

Closed
dongahn opened this issue Aug 22, 2016 · 13 comments
Closed

Comments

@dongahn
Copy link
Member

dongahn commented Aug 22, 2016

We may need an optimization for this code as captured in flux-distribution#14: specifically, this code. Once an optimization is done, please rerun tests.

@dongahn dongahn changed the title Qualify the impact of and optimize the schedule loop's job iteration scheme Quantify the impact of and optimize the schedule loop's job iteration scheme Aug 22, 2016
@dongahn
Copy link
Member Author

dongahn commented Aug 22, 2016

To avoid premature optimization, we will probably want to look at the performance impacts of this on conservative and hybrid backfill algorithms. In terms of peeling this onion in a right way, it would be good to add optimization for #182 first and see the impact of this issue on conservative and hybrid after that.

@lipari
Copy link
Contributor

lipari commented Aug 22, 2016

As suggested in distribution #14, I recommend adding to the sched comms module a setting called "queue-depth" that will impose a limit on the number of jobs considered in each scheduling loop. This would apply for both the FCFS and backfill plugins.

@dongahn
Copy link
Member Author

dongahn commented Aug 23, 2016

@lipari:

This is reasonable, but as you and I discussed (before I realized I had a meeting with Tapasya), my main question is how the queue depth should play with the reservation depth, which we currently use for hybrid backfill scheduling. If that makes sense.

Should queue-depth take precedence over reservation depth? Then, if queue-depth < reservation depth in hybrid backfill, we print out an error or warning message: this policy is not fully supported? Or would it be better to use an alternative parameter which tells the scheduler to consider x number of jobs after reservation depth is satisfied? As you argue, the downside of this is, it may not work with FCFS as its concept for reservation is implicit. But at least it won't change the policy.

In any case, one thing we will probably want to do is to introduce a max for the pure conservative backfill mode. (Make it also configurable). We probably need to understand what that max should be in a stress load like distribution #14.

The other reasonable optimization is not to schedule for each and every job/resource event.

@dongahn
Copy link
Member Author

dongahn commented Aug 23, 2016

OK. I did a quick test for this by just limiting the schedule requested jobs to 100 and this makes a pretty substantial difference (2x):

Jobs Executed Per Second (JEPS) at <2 nodes, 1 broker per node, unit sizing policy, non-mpi sleep, FCFS>: 3.5.

@dongahn
Copy link
Member Author

dongahn commented Aug 23, 2016

In order to reason about the performance in a more comprehensive way, I am pretty convinced that I will need a performance model for our scheduler. I will spend some time to come up w/ one.

@lipari
Copy link
Contributor

lipari commented Aug 23, 2016

@dongahn , With regard to reservation and queue depths, any warning should come from the backfill plugin if the requested reservation depth exceeds the queue depth. Queue depth will take precedence over reservation depth. Given that reservation depth is only relevant to the backfill plugin, I am not in favor of creating a queue depth that is tied to reservation depth.

@dongahn
Copy link
Member Author

dongahn commented Aug 23, 2016

@lipari: OK. This makes sense to me. I will see what I can.

In terms of comparing our notes, here is my $.02 on the scheduling control parameters. I believe that it is not only "scheduler plugins" but also "scheduling control parameters set", that will represent choices a user can make to specialize the scheduling policy on his/her flux instance. E.g., if a user wants HTC UQ scheduling, he may want to instantiate our EASY backfill scheduler w/ a short queue-depth on his flux instance. Users can write and instantiate their own custom scheduling plugin, but most of users will probably want to use ready-made choices.

So, one thing I would like to ensure is that we make scheduling control parameters are not only configurable and manageable, but also have clear semantics. Like you suggested, having compatibility checks is clearly one of the ways to convey this. If this space become large in the end, things can be quickly unwieldy. So what we may ultimately need would a catalogue which names some of the well-known choices.

Now, I looked at SLURM's control parameters and it seems they are not as intuitive as one would hope, and this may be an area where we can do much better.

From @surajpkn's recent email

But a quick look at http://slurm.schedmd.com/sched_config.html seems to suggest what you said, that it only supports conservative backfilling only.

But http://slurm.schedmd.com/slurm.conf.html lists a parameter called “default_queue_depth” which is the number of jobs that SLURM should consider for scheduling during a scheduling cycle. For example, if there are 1000 jobs in the queue, and if default_depth_queue = 100, then when SLURM comes to a scheduling event, it will consider running only the first 100 jobs and ignore the rest. Now, I am not able to understand if it excludes backfilling or even the backfilling is done only within the 100 jobs. So someone with SLURM experience could definitely answer if this parameter can be used for what we want to achieve.

BTW, one of the reasons why I wanted to build a performance model of the scheduler to reason about this space... Sorry for the long tangent.

@SteVwonder
Copy link
Member

I was wondering how this discussion could lead to a performance improvement to FCFS. Glancing again at the code, it seems that the FCFS scheduler loops over every job and schedules any jobs that can be scheduled at the current time. If I'm reading the code correctly, this actually isn't FCFS, this is backfilling without a reservation. (I think I am missing something though).

For FCFS, once the first jobs fails, we should break out of the schedule_jobs loops. Or to keep with the plugin structure of the code, once a call to reserve_resources has been made under the FCFS scheduler, all subsequent calls to find_resources should return an empty tree. A call to sched_loop_setup could then reset the behavior (so that find_resources returns resources again). This would lead to a very fast FCFS scheduler.

For the backfilling scheduler, I agree with the existing discussion that the queue_depth should override the reservation_depth. This would make the parameter useful in the conservative backfilling case and should be much easier to generalize to all 3 types of backfilling.

@dongahn
Copy link
Member Author

dongahn commented Aug 23, 2016

@lipari can weigh in. If I understood him right yesterday, FCFS can do out of order scheduling if a high priority job requests nodes with specific constraints (e.g., bigger memory etc).

@lipari
Copy link
Contributor

lipari commented Aug 23, 2016

If all of the resources being scheduled were identical, then the FCFS scheduler could be simplified to stop searching down the job queue once the first reservation had been made. The current FCFS scheduler supports heterogeneous resources, nodes with more memory, different number of sockets, etc. So using @dongahn 's example, all the big memory nodes could be reserved for the top priority pending job, while small memory nodes could be allocated to lower priority jobs that will accept small memory nodes.

@dongahn
Copy link
Member Author

dongahn commented Aug 23, 2016

First thoughts on the performance model. Perhaps we can model the overall scheduling overheads as the combination of the following overhead terms:

T = T(queue) + T(schedule jobs) + T(program execution protocol) + T(sched framework)

Then,

  • T(queue) consists of T(core submit/enqueue service) and T(sched queue op)
  • T(schedule jobs) consists of T(resrc) which can also further be broken down depending on what we find
  • T(program execution protocol) is the overhead to speak w/ wreck service through its protocol
  • T(framework) is the pure overhead of scheduler framework service's control code

If KVS with many front-loaded submissions cause us problems, then T(core submit/enqueue service) and T(program execution protocol) may grow large as we change the number of jobs...

If the sum of the jobs' computation is not much greater (>>) then T, then we are scheduling-bound for that workload... This kind of reasoning may help @SteVwonder on his "scheduler scalability" metric as well.

I will think some more on this after a concall before doing some work for it.

@SteVwonder
Copy link
Member

SteVwonder commented Aug 23, 2016

I see now. We have different definitions of FCFS. I have been operating under the assumption that FCFS means that jobs will be started in the exact order that they are submitted. So if you produce two lists of the jobs, one sorted based on the jobs' start times and the other sorted based on the jobs' submit times, the two lists should exactly match.

I think we are getting away from the original purpose of this issue though. I think we should move the discussion over to #168, since for the validation, we will need to rigorously define the behaviors that we expect from the different scheduling plugins with various parameters.

@lipari
Copy link
Contributor

lipari commented Sep 13, 2016

Close via #190

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants