Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

splash app tracking issue #1358

Closed
trws opened this issue Mar 15, 2018 · 46 comments
Closed

splash app tracking issue #1358

trws opened this issue Mar 15, 2018 · 46 comments

Comments

@trws
Copy link
Member

trws commented Mar 15, 2018

This is to help tie together information for the splash app for sierra.

General problem:
Run between 1 and 2 million jobs of one node, four processes, in size with proper affinity, each process should be near a GPU and have a GPU selected, over the course of a day across four thousand nodes. It should be possible to cancel a job in flight if its priority falls below a certain level, this logic doesn't have to be in flux but the cancel mechanism needs to be available. In order to deal with the large number of jobs, we need to be able to handle long queues and fast submission simultaneously for startup and a large number of jobs over the course of a full run, purging old jobs is an acceptable solution even if it loses job information for completed jobs.

Currently the filed issues are:

high priority:

Found in this effort, non-blocking:

Tagging for notification: @grondo, @garlick, @SteVwonder, @dongahn

@garlick
Copy link
Member

garlick commented Mar 15, 2018

(sent this in email to @trws but should have posted here for one stop shopping):
Re: redirecting sqlite db to a filesystem with plenty of free space:

By default, flux creates its sqlite db in "broker.rundir", which is normally in /tmp.
An instance that has a lot of KVS activity might want to redirect it elsewhere:

flux start -o,-Spersist-directory=/new/directory

/new/directory must already exist. A "content" subdirectory will be created
within it to contain the database.

@SteVwonder
Copy link
Member

one node, four processes, in size with proper affinity

Flux-core (via wreck) supports setting affinities through the cpumask key in the KVS [src code], but flux-sched does not currently populate that information when scheduling jobs. I have code in a branch of flux-sched that does create this, but I have only tested the code against ~v0.5.0 of flux-core, and its not the most polished c code...sorry 😞 .

each process should be near a GPU and have a GPU selected

Does Sierra support binding processes to specific GPUs? I imagine supporting this will require work both in wreck and flux-sched. How important is locality here? Is it just picking the "closest" socket (e.g., GPUs 1-3 are closer to socket1 and GPUs 4-6 are closer to socket2), or is it more complex?

It should be possible to cancel a job in flight if its priority falls below a certain level

Do they need the jobs to be considered/scheduled in order of their priority? Or are all jobs currently in the queue fair game for scheduling?

@trws
Copy link
Member Author

trws commented Mar 15, 2018

The affinity stuff will almost certainly be handled by a script for this, just for time reasons. GPU affinity can be set by an environment variable, and closest socket is sufficient.

All jobs in the queue are fair game for scheduling, priority is being handled externally.

@grondo
Copy link
Contributor

grondo commented Mar 15, 2018

Copying here from #1356 -- instructions for running flux wreck purge periodically within an instance using flux cron:

  • Purge all but 1000 jobs after every 1000th job completion, running at most once every 10 seconds:
flux cron event --min-interval=10 --nth=1000 wreck.state.complete flux wreck purge -R -t 1000

@dongahn
Copy link
Member

dongahn commented Mar 16, 2018

FYI -- I already sent @trws as to how one can optimize the scheduler for HTC workloads. But in case others need to support this types of workloads, queue-depth and delay-ached should be useful.

flux-framework/flux-sched#190
flux-framework/flux-sched#182
flux-framework/flux-sched#183

flux-framework/flux-sched#191
flux-framework/flux-sched#185

Examples:
https://github.com/flux-framework/flux-sched/blob/master/t/t1005-sched-params.t#L45
https://github.com/flux-framework/flux-sched/blob/master/t/t1005-sched-params.t#L67

@SteVwonder
Copy link
Member

Also mentioned in the email chain:
In this scenario (i.e., all jobs are equal size), FCFS is a good candidate over backfilling. FCFS avoids the cost of making reservations in the future as well as avoids attempting to schedule jobs in the queue that are provably not able to be scheduled. In addition to loading the sched.fcfs plugin, I believe you also need to set queue-depth to 1.

@dongahn
Copy link
Member

dongahn commented Mar 16, 2018

I believe you also need to set queue-depth to 1.

Indeed! I was surprised when @lipari showed cases where FCFS actually does an out-of-order schedule. But in the case of splash, their job sizes are all the same so FCFS with queue-depth being 1 should be the cheapest.

@trws
Copy link
Member Author

trws commented Mar 16, 2018 via email

@dongahn
Copy link
Member

dongahn commented Mar 16, 2018

Particularly assisting priority scheduling by doing a sort on the entire linked-list of jobs every time schedule_jobs is entered

This was the latest addition done by @lipari and @morrone. I'm wondering if there is a option to turn it off.

then traversing the entire hardware tree to clear reservations every time it’s entered as well. There may be others, but those make the performance relatively unfortunate for anything over about 1000 nodes or 1500 jobs in the queue.

FCFS shouldn't do this, though?

@trws
Copy link
Member Author

trws commented Mar 16, 2018 via email

@dongahn
Copy link
Member

dongahn commented Mar 16, 2018

Ah. Maybe sched_loop_setup routine can be modified to do this. Currently, the sched passes jobs info to sched_loop_setup but it can be modified so that the plug-in code can also pass back information on the plug-in's characteristics. (like whether it is reservations capable).

Of course, the setup code should be called first before resrc_tree_release_all_reservations operation though. (Also I haven't looked at what assumption are being made in resrc).

@grondo
Copy link
Contributor

grondo commented Mar 21, 2018

@trws, I'm trying to understand the "proper affinity" requirement. Currently the launcher (wrexecd) doesn't have any concept of whether a node is allocated in whole or in part to a launching job. Unfortunately, the prototype code only works off a count of "cores" assigned to each rank in [lwj-dir].rank.N.cores, and doubly unfortunately the name "cores" for this key is a misnomer as it is really assumed to be the number of tasks assigned to the rank.

There is currently a kludge in place that allows an affinity cpumask to be set in the per-rank directory of a job (at [lwj-dir].rank.N.cpumask)., which could be used to denote the actual cores by logical id that have been assigned on that rank, which in combination with the number of tasks could allow automatic affinity of individual tasks using a defined strategy.

I guess one of the main problems here is that in the wreck prototype we didn't bother separating assigned resources from assigned task counts and now we might have to think of a way around that, since the replacement execution system isn't baked yet.

Let me know what approach you'd like to take, or if you have other ideas, and I'll open a specific issue.

@dongahn
Copy link
Member

dongahn commented Mar 21, 2018

There is currently a kludge in place that allows an affinity cpumask to be set in the per-rank directory of a job (at [lwj-dir].rank.N.cpumask)., which could be used to denote the actual cores by logical id that have been assigned on that rank, which in combination with the number of tasks could allow automatic affinity of individual tasks using a defined strategy.

@SteVwonder may want to chime in. I believe he modified this mechanism to get the affinity control he needs for the exploration of hierarchical scheduling.

@grondo
Copy link
Contributor

grondo commented Mar 21, 2018

I'm willing to do whatever you guys need in flux submit/wrexecd if it makes things easier. Even if sched just writes list of allocated cores in addition to counts I can make that work for this use case I think.

@dongahn
Copy link
Member

dongahn commented Mar 21, 2018

@trws will have to weigh in. But if I understood him right, he's trying to coarsen the hardware tree representation for high scheduling scalability such that each socket vertex contains only one core pool (e.g., core[22] as opposed to 22 core vertices) and schedule cores in terms of its count. If this is successful, my guess is affinity may have be resolved at a different level.

@grondo
Copy link
Contributor

grondo commented Mar 21, 2018

Thanks @dongahn! I think I see.

If the scheduler is assigning non-differentiated cores from a pool, then I don't see a way any other part of the system can calculate what the proper affinity will be. The scheduler is the only thing that knows the global view of the system and which cores in a given socket are currently in use.

The best we can do for now is to bind tasks at the finest grain that the scheduler is tracking (sockets in this case), and let the OS distribute processes optimally within that cpumask.

@grondo
Copy link
Contributor

grondo commented Mar 21, 2018

A more tractable issue we can tackle in this timeframe is affinity for tasks on nodes where the job has been assigned the whole node. There are at least two problems that make this currently not work:

  1. assumption that .rank.cores value is also the number of tasks assigned to the current rank. Possibly, there is no current way to assign more than one "core" to a task (oops...)
  2. wrexecd doesn't currently know if it has the node exclusively. This can easily be fixed with a w/a for the current system (flag on rank.N directory, etc.)

In fact maybe 2 alone would work in the short term? wrexecd could still assume that rank.cores is the number of tasks to run, but if rank.exclusive flag was set, then it would distribute those tasks across cores on the current node using something equivalent to hwloc-distrib? (I really hate to continue the rank.cores hack, but also don't want to spend much time "fixing" the wreck code...)

@trws
Copy link
Member Author

trws commented Mar 22, 2018

For the purpose of the splash app itself, this is not a major worry, since I intend to deal with binding of the GPUs in a script anyway I can do the CPU binding there as well. For ATS, we need to work this out. The two main cases I would like to find a way to support dovetail pretty well with our discussion of the submit command nnodes parameter:

  1. nodes are allocated in full with a given number of tasks, result: each task gets as much independent cache as possible
  2. a certain number of cores is allocated, but the node is not exclusive and as such cores are left available. For now having this be the same as number of tasks is fine, we'll want to loosen that as soon as reasonably possible, but it's fine for now, result could be either of two:
  3. same as above, only problem is that then there will be overlapping core-sets as other jobs come into the node
  4. bind each task to an individual core, as close as possible to one another without sharing cores (no use of PUs on the same core)

this hwloc function generates the cpuset for a given distribution based on a topology and a number of threads. It makes implementing the spread-out version a lot easier than doing it manually (bad, bad memories...). Doing the other one is a round-robin on available cores since you want the nearest ones anyway.

Does this make some sense?

@trws
Copy link
Member Author

trws commented Mar 22, 2018

By the way, I am coarsening the hardware topology somewhat, but mainly to add levels we can't use productively, like multiple per-core caches. The main levels are left alone to avoid having to alter sched to handle something different.

@grondo
Copy link
Contributor

grondo commented Mar 22, 2018

Does this make some sense?

Thanks @trws, your notes above really help clarify the requirements. The issue is indeed simpler than I was initially thinking.

Unfortunately, the wreckrun prototype wasn't designed to handle these situations, so solving this for the short term will require some hackery.

The main issue for now is that the scheduler doesn't currently assign individual resources to a job, just a count of cores on each node. If we can tweak that to write individual resources (a list of cores per node), then this work would be fairly trivial. (We can perhaps use hwloc_bind() as suggested if it allows "unavailable" cores to be masked out)

If this is not currently possible, then we could handle case 1 easily by either a flag set by the scheduler that says the node is assigned exclusively, or by assuming the node is assigned exclusively when rank.cores = total cores. I don't think we want to use 3. above, since as you note it will hurt performance by default on shared nodes. Case 4 is only possible if individual resources are assigned by the scheduler, otherwise we would always assign tasks to the same cores since there is no information about other jobs assigned to the node.

So, here's my proposal:

  • Open issue against sched to write list/cpuset of cores assigned to kvs rank.N directory (will also accept global list of rank,cpuset if that is easier. (just need some way for wrexecd to determine which cores are assigned on its rank. As @garlick pointed out, this would be the first version of Rglobal or Rlocal)
  • Open issue against wrexecd to stop using rank.N.cores as the count of tasks to run on the node, and use global ntasks instead + local cores information to determine which and how many tasks to launch on each rank.
  • Open issue against wrexecd to set default job cpumask to the assigned cpuset (we could also use an actual cpuset, but that is probably overkill for now), and optionally use hwloc_distrib() within that cpuset to distribute tasks. (I'm assuming if the number of cores in the cpuset is equal to the number of things you are distributing then hwloc_distrib will assign a single core to each task)

Does this sound reasonable at all?

@dongahn
Copy link
Member

dongahn commented Mar 22, 2018

The main issue for now is that the scheduler doesn't currently assign individual resources to a job, just a count of cores on each node. If we can tweak that to write individual resources (a list of cores per node), then this work would be fairly trivial. (We can perhaps use hwloc_bind() as suggested if it allows "unavailable" cores to be masked out)

If the core vertices are not pooled together in the resource representation, this should be straightforward, i think. From @TWRS's comment below, it looks like he doesn't coarsen the core representation.

We will have to see how this scales though: the number of core vertices for Sierra = ~4000 x 44 = 176K.

Ultimately, we need aggressive pruning for resource searching. I have it on resource-query. Maybe we can add pruning by exclusivity on resrc if it's easy.

BTW, LSF supports core isolation which has proven to be needed to get good performance on Sierra. If Splash needs this, we need sched not to schedule tasks on those specialized cores for OS daemons.

It may be the case that if lsf does core isolation through cgroup, hwloc doesn't expose those specialized cores to our hwloc module, in which case we should be good.

@grondo
Copy link
Contributor

grondo commented Mar 22, 2018

It may be the case that if lsf does core isolation through cgroup, hwloc doesn't expose those specialized cores to our hwloc module, in which case we should be good.

Yeah, since the brokers are launched under LSF we should inherit the isolation provided by it.

@trws
Copy link
Member Author

trws commented Mar 22, 2018

We would, but my understanding is that none of that has been done yet.

@SteVwonder
Copy link
Member

W.r.t. the resrc_tree_release_all_reservations discussion in today's meeting, you should be able to comment out this function after making the resrc_tree_reserve in the fcfs plugin a noop.

@dongahn
Copy link
Member

dongahn commented Mar 22, 2018

Two action items from today's discussion on this topic:

  1. Even if the job needs only one node with FCFS policy, currently resrc_tree_release_all_reservations is called from within the schedule loop which then traverses the entire resource tree -- a big performance issue. But reservations are only needed for backfill schedulers, and we want to find a way to turn this off for FCFS without having to break any assumption. This is a blocker for splash use case. TODO: @dongahn to look at those assumptions. (@SteVwonder above answer this question, though). Ultimately, this needs to go to the scheduler policy plugin.

  2. Head of line blocking at the message queue seems to cause sched to stall when job submission rate is super high. Three potential issues discussed: 1) too many round trips between kvs and sched to get job information through JSC. One optimization would be to piggyback the submitted job information to wreck.state event (@grondo and @dongahn to look into this?); 2) too many synchronous kvs operations within JSC. One optimization would be to use async kvs within jsc; (@garlick to look into this) 3) converting json strings to json objects and son objects back to json strings within JCS. This isn't a blocker for splash use case, but we agreed it would be good to look at soonish.

@grondo
Copy link
Contributor

grondo commented Mar 22, 2018

One optimization would be to piggyback the submitted job information to wreck.state event

Looking back through the job module, I see that currently jobs first enter the "reserved" state before transitioning to "submitted". In addition to including nnodes,ntasks information in the wreck.state.submitted event, we could also skip the "reserved" state for submitted jobs (I don't see how it is necessary) and save 1 event per submitted job, if that is not an issue for sched.

@trws
Copy link
Member Author

trws commented Mar 22, 2018 via email

@dongahn
Copy link
Member

dongahn commented Mar 22, 2018

Even if these sched didn't do this, reducing state transitions within sched should be pretty easy. What we need is just a contract between wreck and sched.

@grondo
Copy link
Contributor

grondo commented Mar 22, 2018

For wreck the "reserved" state just means that the KVS directory is reserved for wreck as writer. In the case of flux-submit the state immediately transitions to "submitted" without any further data so let's just say the contract is that "reserved" state should be ignored by other consumers, and "submitted" state is the initial state for a job submitted to the scheduler.

I have no idea what the null->null transition signifies.

@garlick
Copy link
Member

garlick commented Mar 22, 2018

libjsc inserts the null->null transition when the "reserved" state event is received from wreck. It calls callbacks once for null->null, and once for null->reserved. @dongahn, what's the rationale here? I'm taking out the null->null transition in my test branch and fixing up the sharness tests to not expect it, but will I break something in sched?

@dongahn
Copy link
Member

dongahn commented Mar 22, 2018

@garlick: I vaguely remember I needed something like that for various synchronization issues that lead to RFC 9. I have to see if that fixup is needed now we only use event to pass the job state. Let me look.

@dongahn
Copy link
Member

dongahn commented Mar 22, 2018

@SteVwonder and @trws: I will try to do a quick PR for the resrc_tree_release_all_reservations issue. I propose:

  • Introduce get_sched_properties () into the sched plugin API as well as struct sched_prop to encode the properties of the scheduler plugins. I can support "backfill capable" as the only property to support at this point.
  • sched framework queries this property during the initialization and the scheduler loop calls resrc_tree_release_all_reservations only if the loaded plugin is backfill capable.
  • As @SteVwonder noted, I will make resrc_tree_reserve a NOOP in resource resources API call.

@grondo
Copy link
Contributor

grondo commented Mar 22, 2018

ok, my wreck-experimental branch has the changes discussed here

  • eliminate 'reserved' state for submitted jobs
  • include ntasks,nnodes,walltime from job request in the wreck.state.submitted event

@dongahn
Copy link
Member

dongahn commented Mar 22, 2018

ok, i will test this soon. Just to make sure I understand, wreck.state.reserved events are not emitted any longer with this version?

@grondo
Copy link
Contributor

grondo commented Mar 22, 2018

ok, i will test this soon. Just to make sure I understand, wreck.state.reserved events are not emitted any longer with this version?

They are emitted only with flux wreckrun, that is when we bypass the scheduler.

@dongahn
Copy link
Member

dongahn commented Mar 23, 2018

  1. Even if the job needs only one node with FCFS policy, currently resrc_tree_release_all_reservations is called from within the schedule loop which then traverses the entire resource tree -- a big performance issue. But reservations are only needed for backfill schedulers, and we want to find a way to turn this off for FCFS without having to break any assumption. This is a blocker for splash use case. TODO: @dongahn to look at those assumptions. (@SteVwonder above answer this question, though). Ultimately, this needs to go to the scheduler policy plugin.

It turned out there is more than making resrc_tree_reserve in the fcfs plugin NOOP. Simply doing the work as #1358 (comment) schedules jobs out of order!

I will look more closely into this later.

@dongahn
Copy link
Member

dongahn commented Mar 23, 2018

I suspect that FCFS scheduler actually requires the reservation capability for the general case.

If the first job doesn't find all of the required resources, the scheduler should reserve those partially found resources so that the scheduler can move on to the next job to see if it can be scheduled.

The reason for this out of order behavior is that the next job may only require a different type of resources than the first job and the fact that the first job is not scheduled shouldn't prevent it from being scheduled -- that is, without having to use the resources that the first job can use at the next schedule loop.

I think we can still remobe the release reservarion, if we assume the FCFS scheduler uses queue-depth=1. I will see if I can make this a special case.

Scheduler optimization like this is within the grand scheme of scheduler specialization. So special casing like this should 't be that bad. That is, as far as we can manage the complexity with config files etc later on.

@SteVwonder
Copy link
Member

I think we can still remobe the release reservarion, if we assume the FCFS scheduler uses queue-depth=1. I will see if I can make this a special case.

@dongahn, that's a great point. The queue-depth=1 requirement can be worked around by having the first call to reserve_resources set a boolean flag that causes subsequent calls to allocate_resources (within the same sched loop) immediately return. A call to sched_loop_setup could reset the boolean flag so that allocate_resources is no longer a NOOP.

The reason for this out of order behavior is that the next job may only require a different type of resources than the first job and the fact that the first job is not scheduled shouldn't prevent it from being scheduled -- that is, without having to use the resources that the first job can use at the next schedule loop.

To be pedantic, FCFS should immediately stop checking jobs once the first job in the queue fails to be scheduled. FCFS always serves the job that has been waiting the longest (i.e., the job that came first). To have a jobs that isn't first in the queue be scheduled wouldn't be FCFS. That would be back-filling with the reservation depth >= 0. That doesn't need to be addressed here, but we should consider making this the behavior of our FCFS plugin to avoid surprising users in the future.

@dongahn
Copy link
Member

dongahn commented Mar 23, 2018

To be pedantic, FCFS should immediately stop checking jobs once the first job in the queue fails to be scheduled. FCFS always serves the job that has been waiting the longest (i.e., the job that came first). To have a jobs that isn't first in the queue be scheduled wouldn't be FCFS. That would be back-filling with the reservation depth >= 0. That doesn't need to be addressed here, but we should consider making this the behavior of our FCFS plugin to avoid surprising users in the future.

Good point. However, I think this was discussed @lipari and I think we agrees that this should be the behavior. (I vaguely remember he convinced us that other schedulers implement FCFS this way). I already have a PR for this, could you review? We can revisit this semantics later if needed though.

@trws
Copy link
Member Author

trws commented Mar 23, 2018

Honestly, I'm fine with out of order for now. You're certainly right that it doesn't fit the model, but for the current push it kinda doesn't matter.

@trws
Copy link
Member Author

trws commented Mar 23, 2018

Also, true fcfs is, as stephen mentions, always depth one. It's an odd side-effect of the decomposition of sched that fcfs implements a partial backfill at the moment.

@dongahn
Copy link
Member

dongahn commented Mar 23, 2018

True. We can call the depth one pedandic FCFS and depth > 1 optimized FCFS.

At the end of the day, queuing policy + scheduler parameters will determine the performance of the scheduler and will serve as our knobs to specialize our scheduling tailored to the workload.

At some point we should name policy plugin + parameters for some of the representative workloads like "HTC small jobs" although we should still expose the individual knobs to users as well.

@trws
Copy link
Member Author

trws commented Mar 23, 2018 via email

@dongahn
Copy link
Member

dongahn commented Mar 23, 2018

libjsc inserts the null->null transition when the "reserved" state event is received from wreck. It calls callbacks once for null->null, and once for null->reserved. @dongahn, what's the rationale here? I'm taking out the null->null transition in my test branch and fixing up the sharness tests to not expect it, but will I break something in sched?

I looked at the code. It seems I did it this way to work around some odd synchronization problems between kvs update and eventing. Maybe I can find the detail from an issue ticket... Since I need to tune the logic w/ respect to @grondo's augmented submit event change, I will see if I can remove this transition.

@dongahn
Copy link
Member

dongahn commented Mar 23, 2018

@garlick: okay I found #205 (comment)

Essentially, this was to break a race condition. My guess is now we are using event to emit the state, we would be able to live without this transition...

@garlick
Copy link
Member

garlick commented Apr 9, 2018

I'll go ahead and close this, now that we have a project board for splash.

@trws, @grondo, @dongahn, and @SteVwonder may want to review the discussion in this issue quicly, to determine if anything was discussed that didn't get peeled off into its own issue.

@garlick garlick closed this as completed Apr 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants