-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jsc: support for augmented wreck.state.submitted event #1389
Conversation
@dongahn, if you want you can cherry pick required patch from my branch for this PR... I will check it out further tonight. |
@SteVwonder: this has a change to your emulator code. I'll appreciate if you can review this. |
Opps, this meant to go to the sched PR. Sorry about this. |
@dongahn, I think so. I can check later, right now picking up kids... |
@grondo: ok, it's picked! |
Oops, looks from Travis like the cherry-pick maybe had a conflict. I'll pull down your branch and see how best to resolve any conflicts (if that is indeed the problem) Sorry about that! |
Sorry. I didn't have time to look at the merge. It looks like |
Well... I copied the latest from that commit but it seems there are previous commits that I should have cherry-picked as well. At this point, I think it would be better, if @grondo, you can do this... I'm afraid I will lose some commit history. |
No, sorry that commit requires the one before (the one that removes the |
@dongahn, do you know if jsc/sched can handle removal of |
Yes. The latest sched PR handles this. It assumes the first job state of a submit is JSC recognizes the |
Great! Let me try pushing with both commits if I have permission to push to your branch? |
I made you a collaborator so you should have push access @grondo! |
Thanks, I'll push an update! |
FYI, I ran a 10K job soak test (adapted to use flux-submit) on your PR with flux-framework/flux-sched#295 as a sanity check and things seem to run smoothly (no issues at all submitting 10K jobs). Do we already have another sched benchmark that could measure improvement in job submission and scheduling for future PRs? |
I can use PerfExplore and compare the performance between two versions. |
Not critical I was just wondering if there was something akin to |
I also have |
No, I think the point of soak is to see how the rss and sqlite-db grow with many, many jobs so possibly not appropriate for a sched benchmark |
As far as I'm concerned this PR is ready to go in. However, perhaps someone else should do the merge since 2/5 commits are authored by me. I did run many thousands of jobs of various sizes through with this PR + your sched PR and no issues. |
Don't merge this yet. I want to make one more modification to allow our emulator to keep the original tool flow. flux-framework/flux-sched#295 (comment) |
Thanks for taking care of that @dongahn! |
Codecov Report
@@ Coverage Diff @@
## master #1389 +/- ##
==========================================
+ Coverage 78.51% 78.53% +0.02%
==========================================
Files 162 162
Lines 29778 29801 +23
==========================================
+ Hits 23379 23404 +25
+ Misses 6399 6397 -2
|
Is this waiting on flux-framework/flux-sched#295 or could it go in now and be fixed up as needed later? I'm keen to get it in so I can rework #1388 on top of it. Happy to press the button (if I'm available when you're ready...should be for a couple more hours this morning, then tomorrow morning) |
The 'reserved' state is meant only for a reserved KVS directory for a job which has not yet been submitted or run (i.e. reserved for wreck as writer). In the case of jobs submitted via flux-submit this state is unecessary, so remove the initial reserved state for submitted jobs, and the corresponding duplicated code that was a result.
Embed the ntasks,nnodes,walltime members of the job request in the wreck.state.submitted and wreck.state.reserved events. This data could be used to save round-trips to the KVS from the scheduler.
Add support for the new wreck.state.submitted event with which job request info such as nnodes and walltime is piggybacked. Schedulers can use this augmented information to reduce KVS accesses to fetch job request information for performance optimization. Elliminate null->null transition code path, a legacy code to deal with a race condition when JSC was using KVS watch for monitoring state changes.
I just rebased @dongahn's branch on current master. I think if we merge this now it will break the current flux-sched, so we'll need to wait until flux-framework/flux-sched#295 is ready so they can go in together. At least I think this is the case, @dongahn or @SteVwonder, please advise if otherwise. |
You might try rebasing on @dongahn's branch now, then it will be a kind of noop to rebase on new master once this is merged (sorry if you've already done this). Hopefully the flux-sched PR won't require any more than trivial changes to this PR. |
I sort of want both PRs to go in as soon as possible given its needs for Splash. I think the only issue with the current sched PR is on the emulator code which @trws won't use. Maybe we can merge the sched PR as is and fix the emulator problem later. This will also help me to do another PR for the lightweight R. @SteVwonder? |
Will do, thanks. I was wondering how fluid changes would be to that PR but it sounds like it's probably stable. |
@dongahn, given @trws needs for splash I also think we should get this in ASAP. If it ok to merge this let's let @garlick push the button. An alternative would be to branch off flux-core and flux-sched/master with a |
It doesn't seem wrong to push master forward for this, given that the exec system will be replaced and that will require this sched/exec interface to be overhauled anyway. I'll push the button in a few minutes if there are no immediate objections. |
I am fine with pushing these through and having the emulator temporarily broken. I can look at flux-framework/flux-sched#295 now and see what is going on. Hopefully, I can put together a PR by the end of the day. |
Thanks @garlick @SteVwonder ! |
As a note, this branch exists, and is now a PR where things are welcome
to go if that’s helpful.
…On 28 Mar 2018, at 11:01, Mark Grondona wrote:
@dongahn, given @trws needs for splash I also think we should get this
in ASAP. If it ok to merge this let's let @garlick push the button.
An alternative would be to branch off flux-core and flux-sched/master
with a `splash` branch where we can make more experimental and
gratuitous changes, then merge back to master the salvageable code
when splash firedrill is over.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1389 (comment)
|
This is in preparation for the upcoming flux-sched PR
and requires @grondo's
job.submit
change that is availableat wreck-experimental.
The new wreck.state.submitted event will be piggybacked with
job request info such as the number of nodes and walltime and
the scheduler will make use of this front-loaded information to
cut down on KVS accesses.
This also removes the null to null job transition code path
which is legacy code to break a race condition way
back when jsc was using KVS watch for job state monitoring.
Adjust jsc test case and README.