Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR to integrate the initial resource-query work into flux-sched #277

Closed
dongahn opened this issue Oct 30, 2017 · 5 comments
Closed

PR to integrate the initial resource-query work into flux-sched #277

dongahn opened this issue Oct 30, 2017 · 5 comments

Comments

@dongahn
Copy link
Member

dongahn commented Oct 30, 2017

I spoke with @grondo and @morrone this morning. Given where we are w/ @grondo's jobshell and @morrone's jobspec works, cleaning up and integrating myresource-query and scheduling infrastructure code into our public flux-sched repo earlier rather than later seems like a good path. I considered completing all heavy-lifting items first before integrating this code to sched -- scalability/performance tuning for boost graph walking and tightening loosen ends, but from our discussions we seem to have much more benefits by making a "functionality cut" available through a public repo.

@dongahn
Copy link
Member Author

dongahn commented Nov 10, 2017

Just FYI -- I tried to post an experimental PR for the initial resource-query work today (before my week-long travel next week) but I don't think I can achieve that goal. At least, I cleaned up the code and have begun to add test cases. Thanks to @garlick, it turned out sharness without test_under_flux () works perfectly. Anyone interested in this progress -- I pushed all of my changes in my local repo and the testing structure looks like this.

@dongahn
Copy link
Member Author

dongahn commented Nov 19, 2017

FYI -- I am close, but I am not there yet to post a PR for my resource-query work. I've added a bunch of sharness test cases and as part of that I discovered a nasty bug in planner which I fixed in PR #281. The resource policy section document has been added to README.md as well.

One problem I am having when I tried to do an experimental PR of this to flux-sched is that the test output from shareness is different between OSX and Linux. On my Mac, the output of a test like this only contains the output of resource-query but on Linux the output contains not only the stdout of resource-query but also commandline interface and input... Probably stdout/stderr control issue.

My parents will visit me so it is less likely I will be able to make much progress this week but if @morrone wants to take a look, I believe the code is in a reasonable shape.

@dongahn
Copy link
Member Author

dongahn commented Nov 21, 2017

  • Added a test case for IO bandwidth aware matching -- much simpler than @SteVwonder's, but some site may model it like very simply like this. (I do plan to move his logic into this as well as part of future work.)

  • Added a section on "Fully vs. Partially Specified Resource Request" into README.md

  • Some of the work items I plan to get to before PR.

    • max/min count test
    • understanding the effect of resource vertex copy constructor on its embedded planner object.
    • document limitations
    • stress tests with many jobs
    • integration (including solving the sharess output difference issues)
  • For a bit more advanced work like locality-aware matching test, power-aware scheduling test, and performance variability aware scheduling test as well as up walk logic hardening will be done as I conduct some research with @tpatki and @trws.

@dongahn
Copy link
Member Author

dongahn commented Nov 28, 2017

I believe I have done all but the last two items (stress tests and the actual integration.) Depending on my time budget, I may spend a minimum amount of time on the UP walk logic. (Hate doing a PR with code that isn't exercised at all...)

@dongahn
Copy link
Member Author

dongahn commented Nov 29, 2017

OK. I ran some many-job tests. I used a GRUG file with 108 compute nodes and ran match allocate_orelse_reserve 10000 job specs with known makespan on my Mac os/x. The performance came out to be about 16 jobs/s. But the schedule is correct! There will be a lot to do to optimize its scalability and performance during the second phase. But I am happy as is for this round.

@dongahn dongahn closed this as completed Mar 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant