Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job: emit gpu request information #1465

Closed
wants to merge 1 commit into from
Closed

job: emit gpu request information #1465

wants to merge 1 commit into from

Conversation

dongahn
Copy link
Member

@dongahn dongahn commented Apr 13, 2018

Very slightly modified version of @TWRS' change to support gpu scheduling within flux-sched.

Unfortunately, make check on quartz fails w/

make[1]: Entering directory `/g/g0/dahn/workspace/flux-cancel/flux-core/t'
make  module/parent.la module/child.la request/req.la shmem/backtoback.t loop/handle.t loop/dispatch.t loop/reactor.t loop/reduce.t loop/log.t loop/logstderr rpc/rpc.t rpc/mrpc.t rolemask/loop.t kz/kzutil kvs/torture kvs/dtree kvs/blobref kvs/hashtest kvs/watch kvs/watch_disconnect kvs/commit kvs/transactionmerge kvs/fence_namespace_remove kvs/fence_invalid module/basic request/treq barrier/tbarrier wreck/rcalc  \
  t0000-sharness.t t0001-basic.t t0002-request.t t0003-module.t t0004-event.t t0005-exec.t t0007-ping.t t0008-attr.t t0009-dmesg.t t0010-generic-utils.t t0011-content-cache.t t0012-content-sqlite.t t0013-config-file.t t0014-runlevel.t t0015-cron.t t0016-cron-faketime.t t0017-security.t t1000-kvs.t t1001-kvs-internals.t t1002-kvs-watch.t t1003-kvs-stress.t t1004-kvs-namespace.t t1005-kvs-security.t t1101-barrier-basic.t t1102-cmddriver.t t1103-apidisconnect.t t1104-kz.t t1105-proxy.t t2000-wreck.t t2001-jsc.t t2002-pmi.t t2003-recurse.t t2004-hydra.t t2005-hwloc-basic.t t2006-joblog.t t2007-caliper.t t2008-althash.t t2100-aggregate.t t3000-mpi-basic.t t4000-issues-test-driver.t t5000-valgrind.t issues/t0441-kvs-put-get.sh issues/t0505-msg-handler-reg.lua issues/t0821-kvs-segfault.sh lua/t0001-send-recv.t lua/t0002-rpc.t lua/t0003-events.t lua/t0004-getattr.t lua/t0007-alarm.t lua/t0009-sequences.t lua/t1000-reactor.t lua/t1001-timeouts.t lua/t1002-kvs.t lua/t1003-iowatcher.t lua/t1004-statwatcher.t lua/t1005-fdwatcher.t ../t/t9990-python-tests.t scripts/event-trace.lua scripts/event-trace-bypass.lua scripts/kvs-watch-until.lua scripts/kvs-get-ex.lua scripts/cpus-allowed.lua scripts/waitfile.lua scripts/t0004-event-helper.sh scripts/tssh valgrind/valgrind-workload.sh kvs/kvs-helper.sh hwloc-data/1N/shared/02-brokers/0.xml hwloc-data/1N/shared/02-brokers/1.xml hwloc-data/1N/nonoverlapping/02-brokers/0.xml hwloc-data/1N/nonoverlapping/02-brokers/1.xml valgrind/valgrind.supp conf.d/private.conf conf.d/shared.conf conf.d/shared_ipc.conf conf.d/shared_none.conf conf.d/bad-toml.conf conf.d/bad-missing.conf conf.d/bad-rank.conf conf.d/priv2.0.conf conf.d/priv2.1.conf
make[2]: Entering directory `/g/g0/dahn/workspace/flux-cancel/flux-core/t'
  CC       module/module_parent_la-parent.lo
  CC       module/module_child_la-child.lo
  CC       request/request_req_la-req.lo
  CC       shmem/shmem_backtoback_t-backtoback.o
  CC       loop/loop_handle_t-handle.o
  CC       loop/loop_dispatch_t-dispatch.o
  CC       loop/loop_reactor_t-reactor.o
  CC       loop/loop_reduce_t-reduce.o
  CC       loop/loop_log_t-log.o
  CC       loop/loop_logstderr-logstderr.o
  CC       rpc/rpc_rpc_t-rpc.o
  CC       rpc/rpc_rpc_t-util.o
  CC       rpc/rpc_mrpc_t-mrpc.o
  CC       rpc/rpc_mrpc_t-util.o
  CC       rolemask/rolemask_loop_t-loop.o
make[2]: *** No rule to make target `kz/kzutil.c', needed by `kz/kz_kzutil-kzutil.o'.  Stop.
make[2]: *** Waiting for unfinished jobs....
make[2]: Leaving directory `/g/g0/dahn/workspace/flux-cancel/flux-core/t'
make[1]: *** [check-am] Error 2
make[1]: Leaving directory `/g/g0/dahn/workspace/flux-cancel/flux-core/t'
make: *** [check-recursive] Error 1

My config line was: ./configure --prefix=/g/g0/dahn/workspace/flux-cancel/inst --disable-python

Slightly modified version of @TWRS' change.

Add -g to wreck to allow for requesting gpus
Propate that request information to scheduler
through job module and jsc.
@grondo
Copy link
Contributor

grondo commented Apr 14, 2018

@dongahn, you might have a stale build system as kzutil was replaced with kzcopy. Try rerunning ./autogen.sh and reconfigure.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.06%) to 78.968% when pulling c28eb04 on dongahn:gpu into 0ebaf83 on flux-framework:master.

@dongahn
Copy link
Member Author

dongahn commented Apr 14, 2018

@grondo: yes that does the trick! But now one of the tests fails on quartz. (Very possible that I'm doing something wrong.)

not ok 64 - wreck: can adjust lwj kvs hiearchy with broker attrs
FAIL: t2000-wreck.t 64 - wreck: can adjust lwj kvs hiearchy with broker attrs
#
#       result=$(flux start -o,-Swreck.lwj-dir-levels=0 flux wreck kvs-path 256) &&
#       test_debug "echo result is $result" &&
#       test "$result" = "lwj.256" &&
#       result=$(flux start -o,-Swreck.lwj-dir-levels=3,-Swreck.lwj-bits-per-dir=6 flux wreck kvs-path 256) &&
#       test_debug "echo result is $result" &&
#       test "$result" = "lwj.0.0.4.256" &&
#       result=$(flux start -o,-Swreck.lwj-dir-levels=0 flux wreckrun echo hello) &&
#       test "$result" = "hello"
#
# failed 1 among 64 test(s)

@grondo
Copy link
Contributor

grondo commented Apr 14, 2018

Oh, hm, I haven't seen that test fail before, I'll try to reproduce on quartz.

@grondo
Copy link
Contributor

grondo commented Apr 14, 2018

Might be a Travis problem, I'm seeing write error: Resource temporarily unavailable in the logs for these jobs... I'll restart the builds and see if things go better today.

@grondo
Copy link
Contributor

grondo commented Apr 14, 2018

When you do have a chance @dongahn, can you run the t2000-wreck.t test with -d -v and see if more information is printed to stderr on that failing test?

@dongahn
Copy link
Member Author

dongahn commented Apr 14, 2018

Will do.

@dongahn
Copy link
Member Author

dongahn commented Apr 14, 2018

not ok 64 - wreck: can adjust lwj kvs hiearchy with broker attrs
#
#	    result=$(flux start -o,-Swreck.lwj-dir-levels=0 flux wreck kvs-path 256) &&
#	    test_debug "echo result is $result" &&
#	    test "$result" = "lwj.256" &&
#	    result=$(flux start -o,-Swreck.lwj-dir-levels=3,-Swreck.lwj-bits-per-dir=6 flux wreck kvs-path 256) &&
#	    test_debug "echo result is $result" &&
#	    test "$result" = "lwj.0.0.4.256" &&
#	    result=$(flux start -o,-Swreck.lwj-dir-levels=0 flux wreckrun echo hello) &&
#	    test "$result" = "hello"
#

    ID NTASKS STATE                    START      RUNTIME    RANKS COMMAND
    51      1 killed     2018-04-14T17:44:06       0.154s        0 sleep
    53      1 exited     2018-04-14T17:44:07       0.059s        0 hostname
    54      1 killed     2018-04-14T17:44:08       0.151s        0 sleep
    55      1 failed     2018-04-14T17:44:09       0.042s        0 hostname
    56      4 exited     2018-04-14T17:44:09       0.063s    [0-3] true
    57      4 exited     2018-04-14T17:44:09       0.063s    [0-3] true
    58      4 exited     2018-04-14T17:44:09       0.065s    [0-3] true
    59      4 exited     2018-04-14T17:44:10       0.065s    [0-3] true
    60      4 exited     2018-04-14T17:44:10       0.065s    [0-3] true
    61      4 exited     2018-04-14T17:44:10       0.066s    [0-3] true
    62      4 exited     2018-04-14T17:44:10       0.065s    [0-3] true
    63      4 exited     2018-04-14T17:44:11       0.064s    [0-3] true
    64      4 exited     2018-04-14T17:44:11       0.063s    [0-3] true
    65      4 exited     2018-04-14T17:44:11       0.064s    [0-3] true
# failed 1 among 64 test(s)

@dongahn
Copy link
Member Author

dongahn commented Apr 14, 2018

Given that the same errors occur in Travis, I have to believe this is due to this PR. Probably, there are some more code paths we need to adjust to complete gpu info propagation. I will look at this next week.

@dongahn
Copy link
Member Author

dongahn commented Apr 19, 2018

This requires a rework per changes in #1472. Close this for now.

@dongahn dongahn closed this Apr 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants