-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wreck: fix event payload and memory leak in job.submit-nocreate #1492
Conversation
Problem: The event generated as a result of `job.submit-nocreate` rpc from flux-wreckrun had ncores and ngpus set to 0 since wreckrun did not forward these values along in the message of the rpc. This results in confusion for sched, the main use case for the submit-nocreate service. Since the wreck/job module now can cache active jobs, there is no longer a reason to require the caller to forward along all data in the job.submit-nocreate rpc. Instead, have the job.create rpc preemptively add the created struct wreck_job to the active_jobs hash, and have the `job.submit-nocreate` callback fetch the fully instantiated job by jobid from the hash. This assumes that job.submit-nocreate will be called on the same rank as job.create, but for the wreckrun case this is certainly true. Fixes flux-framework#1491
The job.submit-nocreate rpc no longer requires any payload members other than jobid. Remove these extraneous arguments from the call.
Problem: In the job module job_submit_only() function, a flux_future_t is created from send_create_event() and then used synchronously with a flux_future_get(), but the future is not destroyed. Free the future in both the successful and error callpaths to avoid leaking data.
One caveat with this fix-- |
Might be good to pull in b70b76e to this PR, which would allow you to declare:
and still answer the job module's ping request. The module could then be loaded with Also, perhaps now would be a good time to insert a comment block in the Sorry about that future leak! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me as is. I made a couple of minor suggestions, but we could also just merge now and fix later.
Thanks, wouldn't that require prepending to FLUX_MODULE_PATH anyway? Another approach I considered was to just add an option to job module to allow submits for testing. However, the dummy middle actually seemed easier (following the example of other modules used for testing). I'd be willing to change it though.
Very good comment. I'll take care of that! |
I assumed not since the other wreck modules in the same directory can be loaded though I didn't check in detail. The dummy module approach seems totally reasonable to me. |
wouldn't that require prepending to FLUX_MODULE_PATH anyway?
I assumed not since the other wreck modules in the same directory can be
loaded though I didn't check in detail.
That silly sched-dummy module is under `t/wreck/`, not a standard module
path from the build tree I don't think. I'll try it and see if it works
though.
… —
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1492 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAtSUt4b-_Tvu-gd9ElF8AxJVBc4kHfYks5ttgVGgaJpZM4TrdG0>
.
|
Oh oops, probably not then. Never mind! |
Problem: The purpose of the job.submit-nocreate rpc and its handler job_submit_only are not entirely clear from the code. Add a descriptive comment block to the top of the function for future contributor reference as suggested by @garlick.
Enhance the event-trace script with ability to print payload or execute a snippet of Lua code on each event.
Add a do-nothing dummy sched module for testing wreck components that use a ping to the "sched" module to determine which job.* interface to use when submitting or running jobs.
Add a couple tests that use the dummy sched module in order to test the events generated by job.submit and job.submit-nocreate. These tests are isolated in their own test file because the wreck job module caches the `sched_loaded` boolean, so the dummy sched module can't be unloaded to revert wreck/job to its previous behavior, which may confuse future tests (e.g. wreckrun would start blocking) Also, subsequent tests could be confused by the tests in this file leaving jobs in a submitted state.
@garlick, I added a comment block as suggested to the top of job_submit_only. (also removed some erroneous cut-and-paste comment in the sched-dummy module) |
Codecov Report
@@ Coverage Diff @@
## master #1492 +/- ##
==========================================
+ Coverage 78.71% 78.78% +0.07%
==========================================
Files 164 164
Lines 30585 30587 +2
==========================================
+ Hits 24076 24099 +23
+ Misses 6509 6488 -21
|
Looks good! Thanks! |
Thanks! |
This fixes #1491 by caching the
struct wreck_job
created injob.create
in theactive_jobs
hash, so it can be fetched and reused injob.submit-nocreate
. Thejob.submit-nocreate
rpc now only requires thejobid
member be sent along in the payload.In order to test the
submit
andsubmit-nocreate
handlers better, a dummy sched module is built for the testsuite underwreck/sched-dummy.la
and is used by a new testt2000-wreck-dummy-sched.t
to test that thesubmitted
events forjob.submit
andjob.submit-nocreate
are properly formed.