PR to integrate the initial resource-query work into flux-sched #277

dongahn · 2017-10-30T23:07:01Z

I spoke with @grondo and @morrone this morning. Given where we are w/ @grondo's jobshell and @morrone's jobspec works, cleaning up and integrating myresource-query and scheduling infrastructure code into our public flux-sched repo earlier rather than later seems like a good path. I considered completing all heavy-lifting items first before integrating this code to sched -- scalability/performance tuning for boost graph walking and tightening loosen ends, but from our discussions we seem to have much more benefits by making a "functionality cut" available through a public repo.

It will help @grondo and me look at concrete code for those "co-design" areas, including "how to handle slots" and decisions on the canonical R spec approach (flux-core#109) vs. R library approach (or doing both);
Allow sched to better track and contribute to jobshell and remote execution services;
Allow sched to better track @morrone's jobspec
Position us to address key sched issues captured in Excessive resrc tree search can be a scalability bottleneck? #184, Sched support for RFC 14 #192, Revise Sched's find() and select() operations to leverage visitor/matcher design patterns #193.

The text was updated successfully, but these errors were encountered:

dongahn · 2017-11-10T23:35:37Z

Just FYI -- I tried to post an experimental PR for the initial resource-query work today (before my week-long travel next week) but I don't think I can achieve that goal. At least, I cleaned up the code and have begun to add test cases. Thanks to @garlick, it turned out sharness without test_under_flux () works perfectly. Anyone interested in this progress -- I pushed all of my changes in my local repo and the testing structure looks like this.

dongahn · 2017-11-19T17:49:11Z

FYI -- I am close, but I am not there yet to post a PR for my resource-query work. I've added a bunch of sharness test cases and as part of that I discovered a nasty bug in planner which I fixed in PR #281. The resource policy section document has been added to README.md as well.

One problem I am having when I tried to do an experimental PR of this to flux-sched is that the test output from shareness is different between OSX and Linux. On my Mac, the output of a test like this only contains the output of resource-query but on Linux the output contains not only the stdout of resource-query but also commandline interface and input... Probably stdout/stderr control issue.

My parents will visit me so it is less likely I will be able to make much progress this week but if @morrone wants to take a look, I believe the code is in a reasonable shape.

dongahn · 2017-11-21T17:15:12Z

Added a test case for IO bandwidth aware matching -- much simpler than @SteVwonder's, but some site may model it like very simply like this. (I do plan to move his logic into this as well as part of future work.)
Added a section on "Fully vs. Partially Specified Resource Request" into README.md
Some of the work items I plan to get to before PR.
- max/min count test
- understanding the effect of resource vertex copy constructor on its embedded planner object.
- document limitations
- stress tests with many jobs
- integration (including solving the sharess output difference issues)
For a bit more advanced work like locality-aware matching test, power-aware scheduling test, and performance variability aware scheduling test as well as up walk logic hardening will be done as I conduct some research with @tpatki and @trws.

dongahn · 2017-11-28T05:15:18Z

I believe I have done all but the last two items (stress tests and the actual integration.) Depending on my time budget, I may spend a minimum amount of time on the UP walk logic. (Hate doing a PR with code that isn't exercised at all...)

dongahn · 2017-11-29T02:48:26Z

OK. I ran some many-job tests. I used a GRUG file with 108 compute nodes and ran match allocate_orelse_reserve 10000 job specs with known makespan on my Mac os/x. The performance came out to be about 16 jobs/s. But the schedule is correct! There will be a lot to do to optimize its scalability and performance during the second phase. But I am happy as is for this round.

dongahn closed this as completed Mar 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR to integrate the initial resource-query work into flux-sched #277

PR to integrate the initial resource-query work into flux-sched #277

dongahn commented Oct 30, 2017 •

edited

Loading

dongahn commented Nov 10, 2017 •

edited

Loading

dongahn commented Nov 19, 2017

dongahn commented Nov 21, 2017 •

edited

Loading

dongahn commented Nov 28, 2017

dongahn commented Nov 29, 2017

PR to integrate the initial resource-query work into flux-sched #277

PR to integrate the initial resource-query work into flux-sched #277

Comments

dongahn commented Oct 30, 2017 • edited Loading

dongahn commented Nov 10, 2017 • edited Loading

dongahn commented Nov 19, 2017

dongahn commented Nov 21, 2017 • edited Loading

dongahn commented Nov 28, 2017

dongahn commented Nov 29, 2017

dongahn commented Oct 30, 2017 •

edited

Loading

dongahn commented Nov 10, 2017 •

edited

Loading

dongahn commented Nov 21, 2017 •

edited

Loading