Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jsc: support for augmented wreck.state.submitted event #1389

Merged
merged 5 commits into from
Mar 28, 2018
Merged

jsc: support for augmented wreck.state.submitted event #1389

merged 5 commits into from
Mar 28, 2018

Conversation

dongahn
Copy link
Member

@dongahn dongahn commented Mar 27, 2018

This is in preparation for the upcoming flux-sched PR
and requires @grondo's job.submit change that is available
at wreck-experimental.

The new wreck.state.submitted event will be piggybacked with
job request info such as the number of nodes and walltime and
the scheduler will make use of this front-loaded information to
cut down on KVS accesses.

This also removes the null to null job transition code path
which is legacy code to break a race condition way
back when jsc was using KVS watch for job state monitoring.

Adjust jsc test case and README.

@grondo
Copy link
Contributor

grondo commented Mar 27, 2018

@dongahn, if you want you can cherry pick required patch from my branch for this PR... I will check it out further tonight.

@dongahn
Copy link
Member Author

dongahn commented Mar 27, 2018

@SteVwonder: this has a change to your emulator code. I'll appreciate if you can review this.

@dongahn
Copy link
Member Author

dongahn commented Mar 27, 2018

Opps, this meant to go to the sched PR. Sorry about this.

@dongahn
Copy link
Member Author

dongahn commented Mar 27, 2018

@grondo: I'd be happy to do that. Is this b39df95 commit alone do?

@grondo
Copy link
Contributor

grondo commented Mar 27, 2018

@dongahn, I think so. I can check later, right now picking up kids...

@coveralls
Copy link

coveralls commented Mar 27, 2018

Coverage Status

Coverage increased (+0.02%) to 78.845% when pulling a1f992e on dongahn:jsc_new_submit into cdc9e9b on flux-framework:master.

@dongahn
Copy link
Member Author

dongahn commented Mar 27, 2018

@grondo: ok, it's picked!

@grondo
Copy link
Contributor

grondo commented Mar 27, 2018

Oops, looks from Travis like the cherry-pick maybe had a conflict. I'll pull down your branch and see how best to resolve any conflicts (if that is indeed the problem)

Sorry about that!

@dongahn
Copy link
Member Author

dongahn commented Mar 27, 2018

Sorry. I didn't have time to look at the merge. It looks like src/modules/wreck/job.c didn't go well. Since I didn't make any mods, I should just take in the whole content from job.c of that commit?

@grondo
Copy link
Contributor

grondo commented Mar 27, 2018

Sorry this was my fault as b39df95 required cec4196 (skip reserved state for submitted job).

I'll make appropriate fixup for you.

@dongahn
Copy link
Member Author

dongahn commented Mar 27, 2018

Well... I copied the latest from that commit but it seems there are previous commits that I should have cherry-picked as well. At this point, I think it would be better, if @grondo, you can do this... I'm afraid I will lose some commit history.

@grondo
Copy link
Contributor

grondo commented Mar 27, 2018

Sorry. I didn't have time to look at the merge. It looks like src/modules/wreck/job.c didn't go well. Since I didn't make any mods, I should just take in the whole content from job.c of that commit?

No, sorry that commit requires the one before (the one that removes the reserved state for submitted jobs). If including that commit will break sched, then I'll reorder them for you.

@grondo
Copy link
Contributor

grondo commented Mar 27, 2018

@dongahn, do you know if jsc/sched can handle removal of reserved state before submitted? If so then we can just cherry pick both commits...

@dongahn
Copy link
Member Author

dongahn commented Mar 27, 2018

Yes. The latest sched PR handles this. It assumes the first job state of a submit is submitted.

JSC recognizes the reserved stste, but the sched has logic, not to be bothered wih it.

@grondo
Copy link
Contributor

grondo commented Mar 27, 2018

Yes. The latest sched PR handles this. It assumes the first job state of a submit is submitted.

Great! Let me try pushing with both commits if I have permission to push to your branch?

@dongahn
Copy link
Member Author

dongahn commented Mar 27, 2018

I made you a collaborator so you should have push access @grondo!

@grondo
Copy link
Contributor

grondo commented Mar 27, 2018

I made you a collaborator so you should have push access @grondo!

Thanks, I'll push an update!

@dongahn
Copy link
Member Author

dongahn commented Mar 27, 2018

Thanks @grondo for pushing those two commits. When this gets in, I will re-test flux-sched #295

@grondo
Copy link
Contributor

grondo commented Mar 27, 2018

FYI, I ran a 10K job soak test (adapted to use flux-submit) on your PR with flux-framework/flux-sched#295 as a sanity check and things seem to run smoothly (no issues at all submitting 10K jobs).

Do we already have another sched benchmark that could measure improvement in job submission and scheduling for future PRs?

@dongahn
Copy link
Member Author

dongahn commented Mar 27, 2018

Do we already have another sched benchmark that could measure improvement in job submission and scheduling for future PRs?

I can use PerfExplore and compare the performance between two versions.

@grondo
Copy link
Contributor

grondo commented Mar 27, 2018

I can use PerfExplore and compare the performance between two versions.

Not critical I was just wondering if there was something akin to soak.sh for use with flux-sched, or if perhaps it would be useful to adapt soak.sh for a haha test for PRs that might affect scheduler and core performance.

@dongahn
Copy link
Member Author

dongahn commented Mar 27, 2018

I also have t1003-stress.t. Maybe that's simpler. I don't think I have purge and things like that set up. Does soak.sh utilize purge?

@grondo
Copy link
Contributor

grondo commented Mar 27, 2018

I also have t1003-stress.t. Maybe that's simpler. I don't think I have purge and things like that set up. Does soak.sh utilize purge?

No, I think the point of soak is to see how the rss and sqlite-db grow with many, many jobs so possibly not appropriate for a sched benchmark

@grondo
Copy link
Contributor

grondo commented Mar 27, 2018

As far as I'm concerned this PR is ready to go in. However, perhaps someone else should do the merge since 2/5 commits are authored by me.

I did run many thousands of jobs of various sizes through with this PR + your sched PR and no issues.

@dongahn
Copy link
Member Author

dongahn commented Mar 27, 2018

Don't merge this yet. I want to make one more modification to allow our emulator to keep the original tool flow. flux-framework/flux-sched#295 (comment)

@grondo
Copy link
Contributor

grondo commented Mar 27, 2018

Thanks for taking care of that @dongahn!

@codecov-io
Copy link

codecov-io commented Mar 28, 2018

Codecov Report

Merging #1389 into master will increase coverage by 0.02%.
The diff coverage is 37.31%.

@@            Coverage Diff             @@
##           master    #1389      +/-   ##
==========================================
+ Coverage   78.51%   78.53%   +0.02%     
==========================================
  Files         162      162              
  Lines       29778    29801      +23     
==========================================
+ Hits        23379    23404      +25     
+ Misses       6399     6397       -2
Impacted Files Coverage Δ
src/common/libjsc/jstatctl.c 75.03% <19.14%> (-3.4%) ⬇️
src/modules/wreck/job.c 72.82% <80%> (+5.7%) ⬆️
src/common/libflux/keepalive.c 86.66% <0%> (-6.67%) ⬇️
src/common/libflux/content.c 86.66% <0%> (-3.34%) ⬇️
src/broker/modservice.c 79.61% <0%> (-0.98%) ⬇️
src/common/libflux/rpc.c 92.56% <0%> (-0.83%) ⬇️
src/modules/kvs/kvs.c 65.33% <0%> (-0.17%) ⬇️
src/common/libflux/request.c 88.46% <0%> (ø) ⬆️
src/common/libflux/message.c 81.36% <0%> (+0.23%) ⬆️
src/broker/module.c 84.07% <0%> (+0.27%) ⬆️
... and 7 more

@garlick
Copy link
Member

garlick commented Mar 28, 2018

Is this waiting on flux-framework/flux-sched#295 or could it go in now and be fixed up as needed later?

I'm keen to get it in so I can rework #1388 on top of it. Happy to press the button (if I'm available when you're ready...should be for a couple more hours this morning, then tomorrow morning)

grondo and others added 5 commits March 28, 2018 10:00
The 'reserved' state is meant only for a reserved KVS directory
for a job which has not yet been submitted or run (i.e. reserved
for wreck as writer). In the case of jobs submitted via flux-submit
this state is unecessary, so remove the initial reserved state for
submitted jobs, and the corresponding duplicated code that was
a result.
Embed the ntasks,nnodes,walltime members of the job request
in the wreck.state.submitted and wreck.state.reserved events.
This data could be used to save round-trips to the KVS from
the scheduler.
Add support for the new wreck.state.submitted event with which
job request info such as nnodes and walltime is piggybacked.

Schedulers can use this augmented information to reduce
KVS accesses to fetch job request information for performance
optimization.

Elliminate null->null transition code path, a legacy code
to deal with a race condition when JSC was using KVS
watch for monitoring state changes.
@grondo
Copy link
Contributor

grondo commented Mar 28, 2018

I just rebased @dongahn's branch on current master.

I think if we merge this now it will break the current flux-sched, so we'll need to wait until flux-framework/flux-sched#295 is ready so they can go in together. At least I think this is the case, @dongahn or @SteVwonder, please advise if otherwise.

@grondo
Copy link
Contributor

grondo commented Mar 28, 2018

I'm keen to get it in so I can rework #1388 on top of it.

You might try rebasing on @dongahn's branch now, then it will be a kind of noop to rebase on new master once this is merged (sorry if you've already done this). Hopefully the flux-sched PR won't require any more than trivial changes to this PR.

@dongahn
Copy link
Member Author

dongahn commented Mar 28, 2018

I think if we merge this now it will break the current flux-sched, so we'll need to wait until flux-framework/flux-sched#295 is ready so they can go in together. At least I think this is the case, @dongahn or @SteVwonder, please advise if otherwise.

I sort of want both PRs to go in as soon as possible given its needs for Splash. I think the only issue with the current sched PR is on the emulator code which @trws won't use. Maybe we can merge the sched PR as is and fix the emulator problem later. This will also help me to do another PR for the lightweight R. @SteVwonder?

@garlick
Copy link
Member

garlick commented Mar 28, 2018

You might try rebasing on @dongahn's branch now

Will do, thanks. I was wondering how fluid changes would be to that PR but it sounds like it's probably stable.

@grondo
Copy link
Contributor

grondo commented Mar 28, 2018

@dongahn, given @trws needs for splash I also think we should get this in ASAP. If it ok to merge this let's let @garlick push the button.

An alternative would be to branch off flux-core and flux-sched/master with a splash branch where we can make more experimental and gratuitous changes, then merge back to master the salvageable code when splash firedrill is over.

@garlick
Copy link
Member

garlick commented Mar 28, 2018

It doesn't seem wrong to push master forward for this, given that the exec system will be replaced and that will require this sched/exec interface to be overhauled anyway. I'll push the button in a few minutes if there are no immediate objections.

@SteVwonder
Copy link
Member

SteVwonder commented Mar 28, 2018

I am fine with pushing these through and having the emulator temporarily broken. I can look at flux-framework/flux-sched#295 now and see what is going on. Hopefully, I can put together a PR by the end of the day.

@grondo
Copy link
Contributor

grondo commented Mar 28, 2018

Thanks @garlick @SteVwonder !

@garlick garlick merged commit bccf233 into flux-framework:master Mar 28, 2018
@trws
Copy link
Member

trws commented Mar 28, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants