Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sched: Add completing state support #341

Merged
merged 7 commits into from
May 11, 2018

Conversation

dongahn
Copy link
Member

@dongahn dongahn commented May 10, 2018

PASS: t2000-fcfs.t 1 - sim: started successfully
FAIL: t2000-fcfs.t 2 - sim: scheduled and ran all jobs
FAIL: t2000-fcfs.t 3 - jobs scheduled in correct order
PASS: t2000-fcfs.t 4 - sim: unloaded
ERROR: t2000-fcfs.t - exited with status 1
PASS: t2001-fcfs-aware.t 1 - sim: started successfully
FAIL: t2001-fcfs-aware.t 2 - sim: scheduled and ran all jobs
FAIL: t2001-fcfs-aware.t 3 - jobs scheduled in correct order
PASS: t2001-fcfs-aware.t 4 - sim: unloaded
ERROR: t2001-fcfs-aware.t - exited with status 1
PASS: t2002-easy.t 1 - sim: started successfully
FAIL: t2002-easy.t 2 - sim: scheduled and ran all jobs
FAIL: t2002-easy.t 3 - jobs scheduled in correct order
PASS: t2002-easy.t 4 - sim: unloaded
ERROR: t2002-easy.t - exited with status 1
PASS: t2003-fcfs-inorder.t 1 - sim: started successfully with queue-depth=1
FAIL: t2003-fcfs-inorder.t 2 - sim: scheduled and ran all jobs with queue-depth=1
FAIL: t2003-fcfs-inorder.t 3 - jobs scheduled in correct order with queue-depth=1
PASS: t2003-fcfs-inorder.t 4 - sim: unloaded
ERROR: t2003-fcfs-inorder.t - exited with status 1

Emulator-based tests are failing with this patch. I will try to get to it this afternoon.

@dongahn dongahn mentioned this pull request May 10, 2018
@dongahn dongahn force-pushed the adjust_completing branch 4 times, most recently from eec08e2 to f7aa24e Compare May 10, 2018 21:07
@dongahn
Copy link
Member Author

dongahn commented May 10, 2018

OK. I fixed the issue with the emulator. @grondo: If this works for you, this should go into ASAP, since the current sched master will be broken against the flux-core master.

@dongahn dongahn changed the title [WIP] sched: Add completing state support sched: Add completing state support May 10, 2018
@grondo
Copy link
Contributor

grondo commented May 10, 2018

Looks fine to me. I can merge once travis reports.

@coveralls
Copy link

coveralls commented May 10, 2018

Coverage Status

Coverage increased (+0.05%) to 74.8% when pulling 9c68583 on dongahn:adjust_completing into 6aa70d5 on flux-framework:master.

@grondo
Copy link
Contributor

grondo commented May 10, 2018

heh, you gotta love the precision on the coveralls coverage report.

Looks like valgrind caught a new leak (maybe)? This was only on one builder so far though.. hm.

==84273== HEAP SUMMARY:
==84273==     in use at exit: 118,054 bytes in 151 blocks
==84273==   total heap usage: 1,091,084 allocs, 1,090,933 frees, 93,023,209 bytes allocated
==84273== 
==84273== 648 (96 direct, 552 indirect) bytes in 1 blocks are definitely lost in loss record 53 of 63
==84273==    at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==84273==    by 0x7A7667D: xzmalloc (xzmalloc.c:39)
==84273==    by 0x7A6CFED: q_enqueue_into_pqueue (sched.c:519)
==84273==    by 0x7A6CFED: job_status_cb (sched.c:2097)
==84273==    by 0x4E587FF: invoke_cbs (jstatctl.c:980)
==84273==    by 0x4E587FF: job_state_cb (jstatctl.c:1123)
==84273==    by 0x4E48C33: call_handler (msg_handler.c:302)
==84273==    by 0x4E48DAE: dispatch_message (msg_handler.c:326)
==84273==    by 0x4E48DAE: handle_cb (msg_handler.c:392)
==84273==    by 0x4E6EF24: ev_invoke_pending (ev.c:3314)
==84273==    by 0x4E720E6: ev_run (ev.c:3717)
==84273==    by 0x4E480D2: flux_reactor_run (reactor.c:128)
==84273==    by 0x7A6E83E: mod_main (sched.c:2228)
==84273==    by 0x40A8F8: module_thread (module.c:158)
==84273==    by 0x59D2183: start_thread (pthread_create.c:312)
==84273==    by 0x5EEA03C: clone (clone.S:111)
==84273== 

@dongahn
Copy link
Member Author

dongahn commented May 10, 2018

What test case was it though?

@grondo
Copy link
Contributor

grondo commented May 10, 2018

I assume t5000-valgrind.t

@grondo
Copy link
Contributor

grondo commented May 10, 2018

Hm, 2 leaks in the clang-3.8 builder -- I wonder if these are somehow intermittent. I'll restart the builders and see if they reproduce.

nd ./flux-sched-0.4.0-168-gc1382e0/_build/sub/t/t5000-valgrind.output
expecting success: 
	echo "found broker at ${BROKER}" >&2 &&
	test -x "$BROKER"
found broker at /home/travis/local2/libexec/flux/cmd/flux-broker
ok 1 - found executable flux-broker
found executable flux-broker
expecting success: 
	flux ${VALGRIND} \
		--tool=memcheck \
		--leak-check=full \
		--gen-suppressions=all \
		--trace-children=no \
		--child-silent-after-fork=yes \
		--num-callers=30 \
		--leak-resolution=med \
		--error-exitcode=1 \
		--suppressions=$VALGRIND_SUPPRESSIONS \
		${BROKER} --shutdown-grace=4 ${VALGRIND_WORKLOAD} 10
==83350== Memcheck, a memory error detector
==83350== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==83350== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==83350== Command: /home/travis/local2/libexec/flux/cmd/flux-broker --shutdown-grace=4 /home/travis/build/flux-framework/flux-sched/flux-sched-0.4.0-168-gc1382e0/t/valgrind/valgrind-workload.sh 10
==83350== 
FLUX_URI=local:///tmp/flux-8BBRWj
==83350== 
==83350== HEAP SUMMARY:
==83350==     in use at exit: 116,958 bytes in 150 blocks
==83350==   total heap usage: 1,089,497 allocs, 1,089,347 frees, 89,590,737 bytes allocated
==83350== 
==83350== 152 (96 direct, 56 indirect) bytes in 1 blocks are definitely lost in loss record 45 of 63
==83350==    at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==83350==    by 0x7C6F25B: xzmalloc (xzmalloc.c:39)
==83350==    by 0x7C664B4: q_enqueue_into_pqueue (sched.c:519)
==83350==    by 0x7C664B4: job_status_cb (sched.c:2097)
==83350==    by 0x4E5922B: invoke_cbs (jstatctl.c:980)
==83350==    by 0x4E5922B: job_state_cb (jstatctl.c:1123)
==83350==    by 0x4E47F17: call_handler (msg_handler.c:302)
==83350==    by 0x4E47DD6: dispatch_message (msg_handler.c:326)
==83350==    by 0x4E47DD6: handle_cb (msg_handler.c:392)
==83350==    by 0x4E76EE8: ev_invoke_pending (ev.c:3314)
==83350==    by 0x4E77C63: ev_run (ev.c:3717)
==83350==    by 0x4E46372: flux_reactor_run (reactor.c:128)
==83350==    by 0x7C65659: mod_main (sched.c:2228)
==83350==    by 0x40A8CD: module_thread (module.c:158)
==83350==    by 0x5E37183: start_thread (pthread_create.c:312)
==83350==    by 0x685803C: clone (clone.S:111)
==83350== 
{
   <insert_a_suppression_name_here>
   Memcheck:Leak
   match-leak-kinds: definite
   fun:malloc
   fun:xzmalloc
   fun:q_enqueue_into_pqueue
   fun:job_status_cb
   fun:invoke_cbs
   fun:job_state_cb
   fun:call_handler
   fun:dispatch_message
   fun:handle_cb
   fun:ev_invoke_pending
   fun:ev_run
   fun:flux_reactor_run
   fun:mod_main
   fun:module_thread
   fun:start_thread
   fun:clone
}
==83350== 496 (48 direct, 448 indirect) bytes in 1 blocks are definitely lost in loss record 51 of 63
==83350==    at 0x4C2CC70: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==83350==    by 0x550652E: safe_malloc (czmq_prelude.h:536)
==83350==    by 0x550652E: zlist_new (zlist.c:57)
==83350==    by 0x909D6D7: resrc_tree_list_new (resrc_tree.c:386)
==83350==    by 0x909D6D7: resrc_tree_new (resrc_tree.c:108)
==83350==    by 0x909AC4B: select_resources (sched_fcfs.c:232)
==83350==    by 0x909AC83: select_resources (sched_fcfs.c:236)
==83350==    by 0x909AB83: select_child (sched_fcfs.c:125)
==83350==    by 0x909AB83: select_children (sched_fcfs.c:155)
==83350==    by 0x909AB83: select_resources (sched_fcfs.c:202)
==83350==    by 0x909AC83: select_resources (sched_fcfs.c:236)
==83350==    by 0x7C63E3B: schedule_job (sched.c:1555)
==83350==    by 0x7C65B2E: schedule_jobs (sched.c:1631)
==83350==    by 0x4E76EE8: ev_invoke_pending (ev.c:3314)
==83350==    by 0x4E77C63: ev_run (ev.c:3717)
==83350==    by 0x4E46372: flux_reactor_run (reactor.c:128)
==83350==    by 0x7C65659: mod_main (sched.c:2228)
==83350==    by 0x40A8CD: module_thread (module.c:158)
==83350==    by 0x5E37183: start_thread (pthread_create.c:312)
==83350==    by 0x685803C: clone (clone.S:111)
==83350== 
{
   <insert_a_suppression_name_here>
   Memcheck:Leak
   match-leak-kinds: definite
   fun:calloc
   fun:safe_malloc
   fun:zlist_new
   fun:resrc_tree_list_new
   fun:resrc_tree_new
   fun:select_resources
   fun:select_resources
   fun:select_child
   fun:select_children
   fun:select_resources
   fun:select_resources
   fun:schedule_job
   fun:schedule_jobs
   fun:ev_invoke_pending
   fun:ev_run
   fun:flux_reactor_run
   fun:mod_main
   fun:module_thread
   fun:start_thread
   fun:clone
}
==83350== LEAK SUMMARY:
==83350==    definitely lost: 144 bytes in 2 blocks
==83350==    indirectly lost: 504 bytes in 19 blocks
==83350==      possibly lost: 0 bytes in 0 blocks
==83350==    still reachable: 115,798 bytes in 128 blocks
==83350==         suppressed: 512 bytes in 1 blocks
==83350== Reachable blocks (those to which a pointer was found) are not shown.
==83350== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==83350== 
==83350== For counts of detected and suppressed errors, rerun with: -v
==83350== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 3 from 2)
not ok 2 - valgrind reports no new errors on single broker run
# failed 1 among 2 test(s)
1..2

@grondo
Copy link
Contributor

grondo commented May 10, 2018

I reran both builders and both failed again with leaks in the t5000-valgrind.t test

@dongahn
Copy link
Member Author

dongahn commented May 11, 2018

@grondo: A question.J_RUNNING -> J_CANCELLED is not a valid transition, correct?

@dongahn
Copy link
Member Author

dongahn commented May 11, 2018

The leak pattern is such that those two lines 1815 1819 are not executed for some jobs.

Since this is the code newly restructrued, I will look into this a bit more closely.

@grondo
Copy link
Contributor

grondo commented May 11, 2018 via email

@dongahn
Copy link
Member Author

dongahn commented May 11, 2018

I believe this is just due to a race condition in valgrind-workload.sh between sched releasing memory vs. the script doing flux module remove sched

@grondo: when does flux wreckrun return? Is there an option where wreckrun returns when sched's completely done w/ its job?

@grondo
Copy link
Contributor

grondo commented May 11, 2018

By default flux-wreckrun will return when the job is in completing complete or failed state and all io streams have been closed.

Two solutions come to mind:

  • In the sched module, when a shutdown request is received, wait for all jobs to finish before exiting the module
  • In valgrind-workload.sh use the kvs-watch-until.lua script to wait until the last job reaches a known safe state

@dongahn
Copy link
Member Author

dongahn commented May 11, 2018

Thanks @grondo. I have a slight reservation for 1) because it can lead to side effects like the scheduler not yielding when there are long running jobs.

The 2) seems more attractive and I can consider a state like deallocated. But this can overwrite provenance data like failed, cancelled etc.

Perhaps we should add another state field like sched_state for the scheduler's own state transition...

@grondo
Copy link
Contributor

grondo commented May 11, 2018

Thanks @grondo. I have a slight reservation for 1) because it can lead to side effects like the scheduler not yielding when there are long running jobs.

I'm sorry, I don't know what you mean by the scheduler not yielding for long-running jobs. I just meant that the scheduler could check for jobs in the exit path, and if there are still active jobs it could either block until they are done, or kill them, or even free data structures associated with the jobs with an error message that some job data may have been lost.

The 2) seems more attractive and I can consider a state like deallocated. But this can overwrite provenance data like failed, cancelled etc.

Hm, I didn't realize there was some internal sched state we were waiting on, sorry. I thought that we were racing with the last flux-wreckrun exiting on completing and the module remove of sched happening before the last job reached completed.

This may actually be more than just a fix for a corner case -- we don't want initial programs with a series of jobs to exit before the jobs are fully complete!

@dongahn
Copy link
Member Author

dongahn commented May 11, 2018

Hm, I didn't realize there was some internal sched state we were waiting on, sorry. I thought that we were racing with the last flux-wreckrun exiting on completing and the module remove of sched happening before the last job reached completed.

Yeah, when the scheduler receives complete, it releases memory associated with the job. So, the small window of time (completing event to the end of this memory release) and module remove sched are racing.

We can introduce dealloc state and make the tester wait for the last job to reach this scheduler state. My only concern is, if the job state we are overwriting is not complete (e.g., failed) we lose valuable provenance data.

We can simply copy the old state into another field (e.g., state_before_dealloc) though.

@grondo
Copy link
Contributor

grondo commented May 11, 2018

Yeah, when the scheduler receives complete, it releases memory associated with the job. So, the small window of time (completing event to the end of this memory release) and module remove sched are racing.

Did this race also exist before introduction of the the completing state, the window was just much smaller? In that case, maybe just waiting for the "complete" state from the final job would be enough to close the race to previous level and workaround the valgrind test failure for now?

The drawback to overwriting the job's state is that tools like flux wreck ls will have to be rewritten (and possibly some tests as well) to read both state and state_after_deallocated from the kvs. If you were going to do this, it would be better to have a per-job sched. directory for each job and perhaps a sched.deallocated key could be set after sched has completely freed memory for each job.

However, it occurs to me that a per-job key will be cumbersome for the general case of waiting for an entire workload to be complete. For a complex mix of jobs, there wouldn't be a guarantee which job would be executed last, so a script would have to check for sched.deallocated key for all jobs. What we really want is a command that will wait until sched has finished processing all jobs and freed all associated memory globally. If we had that, it could be (optionally) enabled in the sched module shutdown path.

BTW, if you are worried about sched blocking on a long running or stalled job, this could be mitigated in an rc3 script by cancelling all running jobs before the sched module is removed.

@dongahn
Copy link
Member Author

dongahn commented May 11, 2018

Did this race also exist before introduction of the the completing state, the window was just much smaller? In that case, maybe just waiting for the "complete" state from the final job would be enough to close the race to previous level and workaround the valgrind test failure for now?

We probably want to take this route for now.

However, it occurs to me that a per-job key will be cumbersome for the general case of waiting for an entire workload to be complete. For a complex mix of jobs, there wouldn't be a guarantee which job would be executed last, so a script would have to check for sched.deallocated key for all jobs. What we really want is a command that will wait until sched has finished processing all jobs and freed all associated memory globally. If we had that, it could be (optionally) enabled in the sched module shutdown path.

We might need both for testing and production use cases.

BTW sched can easily emit a job deallocated event which wreckrun can subscribe to and exit on this event if an option is given. And the workload-wise event can also be emitted if a job is deallocated and there is no job currently sitting in either the pending queue or the running queue.

@grondo
Copy link
Contributor

grondo commented May 11, 2018

We probably want to take this route for now.

Ok, we can copy kvs-watch-until.lua helper script from flux-core, and then use it something like this in valgrind-workload.sh before sched module unload:

# Wait up to 5s for last job to be fully complete before removing sched module:
KEY=$(echo $(flux wreck last-jobid -p).state)
${SHARNESS_TEST_SRCDIR}/scripts/kvs-watch-until.lua -t 5 $KEY 'v == "complete"'

If you want I could push a change directly onto this PR for evaluation purposes

@grondo
Copy link
Contributor

grondo commented May 11, 2018

It turns out we'll also need to update sharness.sh itself for the above to work, because the older sched version doesn't export $SHARNESS_TEST_SRCDIR

If you want I can add this synchronization and sharness update as a separate PR on which you could rebase this one once merged.

@grondo
Copy link
Contributor

grondo commented May 11, 2018

@dongahn, there are 3 commits you can cherry-pick from https://github.com/grondo/flux-sched/tree/valgrind-sync

  • f5b1776 testsuite: ensure last job has completed in valgrind test
  • 3194926 testsuite: add kvs-watch-until.lua
  • e9c8746 testsuite: update sharness.sh to version from flux-core

@dongahn
Copy link
Member Author

dongahn commented May 11, 2018

Thanks @grondo. Hmmm, at e9c8746, t000-sharness.t fails in my build on quartz. I may be doing something wrong though.

PASS: t0000-sharness.t 1 - success is reported like this
XFAIL: t0000-sharness.t 2 - pretend we have a known breakage # TODO known breakage
FAIL: t0000-sharness.t 3 - pretend we have a fully passing test suite
FAIL: t0000-sharness.t 4 - pretend we have a partially passing test suite
FAIL: t0000-sharness.t 5 - pretend we have a known breakage
FAIL: t0000-sharness.t 6 - pretend we have fixed a known breakage
FAIL: t0000-sharness.t 7 - pretend we have fixed one of two known breakages (run in sub sharness)
FAIL: t0000-sharness.t 8 - pretend we have a pass, fail, and known breakage
FAIL: t0000-sharness.t 9 - pretend we have a mix of all possible results
PASS: t0000-sharness.t 10 - test runs if prerequisite is satisfied
SKIP: t0000-sharness.t 11 # SKIP unmet prerequisite causes test to be skipped (missing DONTHAVEIT)
PASS: t0000-sharness.t 12 - test runs if prerequisites are satisfied
SKIP: t0000-sharness.t 13 # SKIP unmet prerequisites causes test to be skipped (missing DONTHAVEIT of HAVEIT,DONTHAVEIT)
SKIP: t0000-sharness.t 14 # SKIP unmet prerequisites causes test to be skipped (missing DONTHAVEIT of DONTHAVEIT,HAVEIT)
PASS: t0000-sharness.t 15 - tests clean up after themselves
FAIL: t0000-sharness.t 16 - tests clean up even on failures
FAIL: t0000-sharness.t 17 - cleanup functions tun at the end of the test
ERROR: t0000-sharness.t - exited with status 1

@grondo
Copy link
Contributor

grondo commented May 11, 2018

Hmmm, at e9c8746, t000-sharness.t fails in my build on quartz. I may be doing something wrong though.

Let me try a fresh checkout and see if I can reproduce

@grondo
Copy link
Contributor

grondo commented May 11, 2018

Oh, that test should be updated anyway. I pushed a new commit that updates t0000-sharness.t can you try applying that as well?

If that doesn't work, can you post the results of running t0000-sharness.t -d -v?

Thanks

@grondo
Copy link
Contributor

grondo commented May 11, 2018

@dongahn, fresh checkout of my branch above works for me on quartz. Post the results of running the failing test in debug mode and we can see if that gives any clues.

@dongahn
Copy link
Member Author

dongahn commented May 11, 2018

Ok. Thanks. I will try this again. But it will have to be this afternoon though.

@dongahn dongahn force-pushed the adjust_completing branch from f7aa24e to 6ec139b Compare May 11, 2018 21:09
@codecov-io
Copy link

codecov-io commented May 11, 2018

Codecov Report

Merging #341 into master will increase coverage by 0.02%.
The diff coverage is 64.28%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #341      +/-   ##
==========================================
+ Coverage   73.31%   73.34%   +0.02%     
==========================================
  Files          59       59              
  Lines        9853     9840      -13     
==========================================
- Hits         7224     7217       -7     
+ Misses       2629     2623       -6
Impacted Files Coverage Δ
simulator/simsrv.c 77.66% <ø> (ø) ⬆️
simulator/submitsrv.c 79% <ø> (ø) ⬆️
simulator/simulator.c 90.87% <100%> (+0.07%) ⬆️
simulator/sim_execsrv.c 85.14% <50%> (-0.64%) ⬇️
sched/sched.c 73.68% <65%> (+0.24%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6aa70d5...9c68583. Read the comment docs.

@dongahn dongahn force-pushed the adjust_completing branch from 6ec139b to 9035446 Compare May 11, 2018 21:11
@dongahn
Copy link
Member Author

dongahn commented May 11, 2018

OK. I cherry-picked those suggested commits and make check runs well on Quartz.

Your changes look good. If you want, I can squash

53250e4 and 53250e4

to minimize test failures of the 53250e4 commit. But I am also okay as is.

I also kept 8e29304 as a separate commit but this can be squashed to 23a611a as well if you want.

@grondo
Copy link
Contributor

grondo commented May 11, 2018

Yes, I think the separate commits that update sharness.sh and t0000-sharness.t should be squashed. Thanks!

grondo and others added 6 commits May 11, 2018 14:19
Update the flux-sched sharness.sh and t0000-sharness.t scripts
to the lastest version from flux-core.
Add the kvs-watch-until.lua helper script from flux-core to
t/scripts in flux-sched.
Synchronize to ensure last flux-wreckrun job has reached the
completed state in the t5000-valgrind.t test to avoid racing
with the sched module releasing memory for complete jobs.
@dongahn dongahn force-pushed the adjust_completing branch from 9035446 to 9c68583 Compare May 11, 2018 21:21
@dongahn
Copy link
Member Author

dongahn commented May 11, 2018

OK. Squashed.

@grondo
Copy link
Contributor

grondo commented May 11, 2018

Thanks!

@grondo
Copy link
Contributor

grondo commented May 11, 2018

Ok to merge?

@dongahn
Copy link
Member Author

dongahn commented May 11, 2018

Yes, looks good to me.

@grondo grondo merged commit 69767b4 into flux-framework:master May 11, 2018
@dongahn
Copy link
Member Author

dongahn commented May 11, 2018

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants