-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
splash app tracking issue #1358
Comments
(sent this in email to @trws but should have posted here for one stop shopping):
|
Flux-core (via wreck) supports setting affinities through the
Does Sierra support binding processes to specific GPUs? I imagine supporting this will require work both in wreck and flux-sched. How important is locality here? Is it just picking the "closest" socket (e.g., GPUs 1-3 are closer to socket1 and GPUs 4-6 are closer to socket2), or is it more complex?
Do they need the jobs to be considered/scheduled in order of their priority? Or are all jobs currently in the queue fair game for scheduling? |
The affinity stuff will almost certainly be handled by a script for this, just for time reasons. GPU affinity can be set by an environment variable, and closest socket is sufficient. All jobs in the queue are fair game for scheduling, priority is being handled externally. |
Copying here from #1356 -- instructions for running
|
FYI -- I already sent @trws as to how one can optimize the scheduler for HTC workloads. But in case others need to support this types of workloads, queue-depth and delay-ached should be useful. flux-framework/flux-sched#190 flux-framework/flux-sched#191 Examples: |
Also mentioned in the email chain: |
Indeed! I was surprised when @lipari showed cases where FCFS actually does an out-of-order schedule. But in the case of splash, their job sizes are all the same so FCFS with |
I’ll definitely try that, a lower queue depth helped a little bit, but it looks like we had some non-scalable things in the sched loop too. Particularly assisting priority scheduling by doing a sort on the entire linked-list of jobs every time schedule_jobs is entered, and then traversing the entire hardware tree to clear reservations every time it’s entered as well. There may be others, but those make the performance relatively unfortunate for anything over about 1000 nodes or 1500 jobs in the queue.
…---
Sent from VMware Boxer
On March 16, 2018 at 10:39:36 AM PDT, Dong H. Ahn <[email protected]> wrote:
I believe you also need to set queue-depth to 1.
Indeed! I was surprised when @lipari<https://github.com/lipari> showed cases where FCFS actually does an out-of-order schedule. But in the case of splash, their job sizes are all the same so FCFS with queue-depth being 1 should be the cheapest.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#1358 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AAoStaAnZ6RsDV9Ax_kYaR2YpPeiCK5uks5te_i6gaJpZM4Sssgp>.
|
This was the latest addition done by @lipari and @morrone. I'm wondering if there is a option to turn it off.
FCFS shouldn't do this, though? |
The problem is that both are done outside of the plugin, so it happens regardless of the actual algorithm being used. I’m looking through it to see how that can be fixed without breaking some assumptions elsewhere. It may be we need another function for pre-schedule-loop setup in scheduler plugins to factor this out.
…---
Sent from VMware Boxer
On March 16, 2018 at 11:53:05 AM PDT, Dong H. Ahn <[email protected]> wrote:
Particularly assisting priority scheduling by doing a sort on the entire linked-list of jobs every time schedule_jobs is entered
This was the latest addition done by @lipari<https://github.com/lipari> and @morrone<https://github.com/morrone>. I'm wondering if there is a option to turn it off.
then traversing the entire hardware tree to clear reservations every time it’s entered as well. There may be others, but those make the performance relatively unfortunate for anything over about 1000 nodes or 1500 jobs in the queue.
FCFS shouldn't do this, though?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#1358 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AAoStf_0-vrHCHBwXN3mNZ6AQKue-8nSks5tfAoCgaJpZM4Sssgp>.
|
Ah. Maybe Of course, the setup code should be called first before |
@trws, I'm trying to understand the "proper affinity" requirement. Currently the launcher ( There is currently a kludge in place that allows an affinity cpumask to be set in the per-rank directory of a job (at I guess one of the main problems here is that in the wreck prototype we didn't bother separating assigned resources from assigned task counts and now we might have to think of a way around that, since the replacement execution system isn't baked yet. Let me know what approach you'd like to take, or if you have other ideas, and I'll open a specific issue. |
@SteVwonder may want to chime in. I believe he modified this mechanism to get the affinity control he needs for the exploration of hierarchical scheduling. |
I'm willing to do whatever you guys need in |
@trws will have to weigh in. But if I understood him right, he's trying to coarsen the hardware tree representation for high scheduling scalability such that each socket vertex contains only one core pool (e.g., core[22] as opposed to 22 core vertices) and schedule cores in terms of its count. If this is successful, my guess is affinity may have be resolved at a different level. |
Thanks @dongahn! I think I see. If the scheduler is assigning non-differentiated cores from a pool, then I don't see a way any other part of the system can calculate what the proper affinity will be. The scheduler is the only thing that knows the global view of the system and which cores in a given socket are currently in use. The best we can do for now is to bind tasks at the finest grain that the scheduler is tracking (sockets in this case), and let the OS distribute processes optimally within that cpumask. |
A more tractable issue we can tackle in this timeframe is affinity for tasks on nodes where the job has been assigned the whole node. There are at least two problems that make this currently not work:
In fact maybe 2 alone would work in the short term? wrexecd could still assume that |
For the purpose of the splash app itself, this is not a major worry, since I intend to deal with binding of the GPUs in a script anyway I can do the CPU binding there as well. For ATS, we need to work this out. The two main cases I would like to find a way to support dovetail pretty well with our discussion of the submit command nnodes parameter:
this hwloc function generates the cpuset for a given distribution based on a topology and a number of threads. It makes implementing the spread-out version a lot easier than doing it manually (bad, bad memories...). Doing the other one is a round-robin on available cores since you want the nearest ones anyway. Does this make some sense? |
By the way, I am coarsening the hardware topology somewhat, but mainly to add levels we can't use productively, like multiple per-core caches. The main levels are left alone to avoid having to alter sched to handle something different. |
Thanks @trws, your notes above really help clarify the requirements. The issue is indeed simpler than I was initially thinking. Unfortunately, the wreckrun prototype wasn't designed to handle these situations, so solving this for the short term will require some hackery. The main issue for now is that the scheduler doesn't currently assign individual resources to a job, just a count of cores on each node. If we can tweak that to write individual resources (a list of cores per node), then this work would be fairly trivial. (We can perhaps use If this is not currently possible, then we could handle case 1 easily by either a flag set by the scheduler that says the node is assigned exclusively, or by assuming the node is assigned exclusively when So, here's my proposal:
Does this sound reasonable at all? |
If the core vertices are not pooled together in the resource representation, this should be straightforward, i think. From @TWRS's comment below, it looks like he doesn't coarsen the core representation. We will have to see how this scales though: the number of core vertices for Sierra = ~4000 x 44 = 176K. Ultimately, we need aggressive pruning for resource searching. I have it on resource-query. Maybe we can add pruning by exclusivity on resrc if it's easy. BTW, LSF supports core isolation which has proven to be needed to get good performance on Sierra. If Splash needs this, we need sched not to schedule tasks on those specialized cores for OS daemons. It may be the case that if lsf does core isolation through cgroup, hwloc doesn't expose those specialized cores to our hwloc module, in which case we should be good. |
Yeah, since the brokers are launched under LSF we should inherit the isolation provided by it. |
We would, but my understanding is that none of that has been done yet. |
W.r.t. the |
Two action items from today's discussion on this topic:
|
Looking back through the job module, I see that currently jobs first enter the "reserved" state before transitioning to "submitted". In addition to including |
We could save more than that really, there’s also a null->null
transition event. Sched would be perfectly happy about that, it just
falls through a switch statement to implement both currently.
…On 22 Mar 2018, at 14:24, Mark Grondona wrote:
> One optimization would be to piggyback the submitted job
information to wreck.state event
Looking back through the job module, I see that currently jobs first
enter the "reserved" state before transitioning to "submitted". In
addition to including `nnodes`,`ntasks` information in the
`wreck.state.submitted` event, we could also skip the "reserved" state
for submitted jobs (I don't see how it is necessary) and save 1 event
per submitted job, if that is not an issue for sched.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1358 (comment)
|
Even if these sched didn't do this, reducing state transitions within sched should be pretty easy. What we need is just a contract between wreck and sched. |
For wreck the "reserved" state just means that the KVS directory is reserved for wreck as writer. In the case of I have no idea what the null->null transition signifies. |
libjsc inserts the null->null transition when the "reserved" state event is received from wreck. It calls callbacks once for null->null, and once for null->reserved. @dongahn, what's the rationale here? I'm taking out the null->null transition in my test branch and fixing up the sharness tests to not expect it, but will I break something in sched? |
@SteVwonder and @trws: I will try to do a quick PR for the
|
ok, my wreck-experimental branch has the changes discussed here
|
ok, i will test this soon. Just to make sure I understand, |
They are emitted only with |
It turned out there is more than making I will look more closely into this later. |
I suspect that FCFS scheduler actually requires the reservation capability for the general case. If the first job doesn't find all of the required resources, the scheduler should reserve those partially found resources so that the scheduler can move on to the next job to see if it can be scheduled. The reason for this out of order behavior is that the next job may only require a different type of resources than the first job and the fact that the first job is not scheduled shouldn't prevent it from being scheduled -- that is, without having to use the resources that the first job can use at the next schedule loop. I think we can still remobe the release reservarion, if we assume the FCFS scheduler uses Scheduler optimization like this is within the grand scheme of scheduler specialization. So special casing like this should 't be that bad. That is, as far as we can manage the complexity with config files etc later on. |
@dongahn, that's a great point. The
To be pedantic, |
Good point. However, I think this was discussed @lipari and I think we agrees that this should be the behavior. (I vaguely remember he convinced us that other schedulers implement FCFS this way). I already have a PR for this, could you review? We can revisit this semantics later if needed though. |
Honestly, I'm fine with out of order for now. You're certainly right that it doesn't fit the model, but for the current push it kinda doesn't matter. |
Also, true fcfs is, as stephen mentions, always depth one. It's an odd side-effect of the decomposition of sched that fcfs implements a partial backfill at the moment. |
True. We can call the depth one pedandic FCFS and depth > 1 optimized FCFS. At the end of the day, queuing policy + scheduler parameters will determine the performance of the scheduler and will serve as our knobs to specialize our scheduling tailored to the workload. At some point we should name policy plugin + parameters for some of the representative workloads like "HTC small jobs" although we should still expose the individual knobs to users as well. |
Sounds like a good idea to me.
…On 23 Mar 2018, at 9:05, Dong H. Ahn wrote:
True. We can call the depth one pedandic FCFS and depth > 1 optimized
FCFS.
At the end of the day, queuing policy + scheduler parameters will
determine the performance of the scheduler and will serve as our knobs
to specialize our scheduling tailored to the workload.
At some point we should name policy plugin + parameters for some of
the representative workloads like "HTC small jobs" although we should
still expose the individual knobs to users as well.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1358 (comment)
|
I looked at the code. It seems I did it this way to work around some odd synchronization problems between kvs update and eventing. Maybe I can find the detail from an issue ticket... Since I need to tune the logic w/ respect to @grondo's augmented submit event change, I will see if I can remove this transition. |
@garlick: okay I found #205 (comment) Essentially, this was to break a race condition. My guess is now we are using event to emit the state, we would be able to live without this transition... |
I'll go ahead and close this, now that we have a project board for splash. @trws, @grondo, @dongahn, and @SteVwonder may want to review the discussion in this issue quicly, to determine if anything was discussed that didn't get peeled off into its own issue. |
This is to help tie together information for the splash app for sierra.
General problem:
Run between 1 and 2 million jobs of one node, four processes, in size with proper affinity, each process should be near a GPU and have a GPU selected, over the course of a day across four thousand nodes. It should be possible to cancel a job in flight if its priority falls below a certain level, this logic doesn't have to be in flux but the cancel mechanism needs to be available. In order to deal with the large number of jobs, we need to be able to handle long queues and fast submission simultaneously for startup and a large number of jobs over the course of a full run, purging old jobs is an acceptable solution even if it loses job information for completed jobs.
Currently the filed issues are:
high priority:
Found in this effort, non-blocking:
Tagging for notification: @grondo, @garlick, @SteVwonder, @dongahn
The text was updated successfully, but these errors were encountered: