Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wreck: fix event payload and memory leak in job.submit-nocreate #1492

Merged
merged 7 commits into from
Apr 29, 2018

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Apr 28, 2018

This fixes #1491 by caching the struct wreck_job created in job.create in the active_jobs hash, so it can be fetched and reused in job.submit-nocreate. The job.submit-nocreate rpc now only requires the jobid member be sent along in the payload.

In order to test the submit and submit-nocreate handlers better, a dummy sched module is built for the testsuite under wreck/sched-dummy.la and is used by a new test t2000-wreck-dummy-sched.t to test that the submitted events for job.submit and job.submit-nocreate are properly formed.

grondo added 3 commits April 27, 2018 19:43
Problem: The event generated as a result of `job.submit-nocreate`
rpc from flux-wreckrun had ncores and ngpus set to 0 since wreckrun
did not forward these values along in the message of the rpc. This
results in confusion for sched, the main use case for the
submit-nocreate service.

Since the wreck/job module now can cache active jobs, there is no
longer a reason to require the caller to forward along all data in
the job.submit-nocreate rpc. Instead, have the job.create rpc
preemptively add the created struct wreck_job to the active_jobs
hash, and have the `job.submit-nocreate` callback fetch the fully
instantiated job by jobid from the hash.

This assumes that job.submit-nocreate will be called on the same
rank as job.create, but for the wreckrun case this is certainly
true.

Fixes flux-framework#1491
The job.submit-nocreate rpc no longer requires any payload members
other than jobid. Remove these extraneous arguments from the call.
Problem: In the job module job_submit_only() function, a
flux_future_t is created from send_create_event() and then
used synchronously with a flux_future_get(), but the future
is not destroyed.

Free the future in both the successful and error callpaths to
avoid leaking data.
@grondo grondo requested a review from garlick April 28, 2018 15:32
@grondo
Copy link
Contributor Author

grondo commented Apr 28, 2018

One caveat with this fix-- job.submit-nocreate must be called on the same rank from which job.create is called. This is always the case with flux-wreckrun, but if other use cases somehow violate this assumption then submit-nocreate may not work. (actually it will suffer from a similar problem that this PR fixed)

@coveralls
Copy link

coveralls commented Apr 28, 2018

Coverage Status

Coverage increased (+0.07%) to 79.089% when pulling a887d1c on grondo:issue#1491 into 33b5871 on flux-framework:master.

@garlick
Copy link
Member

garlick commented Apr 29, 2018

Might be good to pull in b70b76e to this PR, which would allow you to declare:

MOD_NAME ("sched-dummy");
MOD_SERVICE ("sched");

and still answer the job module's ping request.

The module could then be loaded with flux module load sched-dummy rather than having to specify the full path.

Also, perhaps now would be a good time to insert a comment block in the job_submit_only() function describing how it is used by wreckrun. I keep forgetting why we need it.

Sorry about that future leak!

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me as is. I made a couple of minor suggestions, but we could also just merge now and fix later.

@grondo
Copy link
Contributor Author

grondo commented Apr 29, 2018

The module could then be loaded with flux module load sched-dummy rather than having to specify the full path.

Thanks, wouldn't that require prepending to FLUX_MODULE_PATH anyway?

Another approach I considered was to just add an option to job module to allow submits for testing. However, the dummy middle actually seemed easier (following the example of other modules used for testing). I'd be willing to change it though.

Also, perhaps now would be a good time to insert a comment block in the job_submit_only() function describing how it is used by wreckrun. I keep forgetting why we need it.

Very good comment. I'll take care of that!

@garlick
Copy link
Member

garlick commented Apr 29, 2018

wouldn't that require prepending to FLUX_MODULE_PATH anyway?

I assumed not since the other wreck modules in the same directory can be loaded though I didn't check in detail.

The dummy module approach seems totally reasonable to me.

@grondo
Copy link
Contributor Author

grondo commented Apr 29, 2018 via email

@garlick
Copy link
Member

garlick commented Apr 29, 2018

Oh oops, probably not then. Never mind!

grondo added 4 commits April 29, 2018 20:07
Problem: The purpose of the job.submit-nocreate rpc and its handler
job_submit_only are not entirely clear from the code.

Add a descriptive comment block to the top of the function for future
contributor reference as suggested by @garlick.
Enhance the event-trace script with ability to print payload
or execute a snippet of Lua code on each event.
Add a do-nothing dummy sched module for testing wreck components
that use a ping to the "sched" module to determine which job.* interface
to use when submitting or running jobs.
Add a couple tests that use the dummy sched module in order to
test the events generated by job.submit and job.submit-nocreate.

These tests are isolated in their own test file because the wreck
job module caches the `sched_loaded` boolean, so the dummy sched
module can't be unloaded to revert wreck/job to its previous behavior,
which may confuse future tests (e.g. wreckrun would start blocking)

Also, subsequent tests could be confused by the tests in this file
leaving jobs in a submitted state.
@grondo grondo changed the title Fix event payload for job.submit-nocreate wreck: fix event payload and leaks in job.submit-nocreate Apr 29, 2018
@grondo grondo changed the title wreck: fix event payload and leaks in job.submit-nocreate wreck: fix event payload and memory leak in job.submit-nocreate Apr 29, 2018
@grondo
Copy link
Contributor Author

grondo commented Apr 29, 2018

@garlick, I added a comment block as suggested to the top of job_submit_only. (also removed some erroneous cut-and-paste comment in the sched-dummy module)

@codecov-io
Copy link

Codecov Report

Merging #1492 into master will increase coverage by 0.07%.
The diff coverage is 85.71%.

@@            Coverage Diff             @@
##           master    #1492      +/-   ##
==========================================
+ Coverage   78.71%   78.78%   +0.07%     
==========================================
  Files         164      164              
  Lines       30585    30587       +2     
==========================================
+ Hits        24076    24099      +23     
+ Misses       6509     6488      -21
Impacted Files Coverage Δ
src/modules/wreck/job.c 75.47% <85.71%> (+2.57%) ⬆️
src/common/libflux/rpc.c 93.38% <0%> (-0.83%) ⬇️
src/broker/overlay.c 73.81% <0%> (-0.32%) ⬇️
src/cmd/flux-module.c 85.06% <0%> (-0.31%) ⬇️
src/common/libflux/message.c 81.48% <0%> (+0.11%) ⬆️
src/common/libflux/handle.c 83.91% <0%> (+0.24%) ⬆️
src/common/libflux/future.c 88.78% <0%> (+0.46%) ⬆️
src/common/libutil/dirwalk.c 94.28% <0%> (+0.71%) ⬆️
src/bindings/lua/flux-lua.c 82.35% <0%> (+0.77%) ⬆️
... and 2 more

@garlick
Copy link
Member

garlick commented Apr 29, 2018

Looks good! Thanks!

@garlick garlick merged commit ccc0196 into flux-framework:master Apr 29, 2018
@grondo grondo deleted the issue#1491 branch April 29, 2018 21:13
@grondo
Copy link
Contributor Author

grondo commented Apr 29, 2018

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

wreckrun and sched don't get along
4 participants