flux session startup synchronization point is needed #8

garlick · 2014-09-29T19:45:21Z

When a fast-cycling session is launched, for example to run a test as a subcommand, say

flux start [--size N] /bin/true

The rank 0 command (/bin/true) executes immediately, potentially before any children have established connections to rank 0. When the command exits, the rank 0 broker sends out a shutdown event. If the shutdown event is sent before the children connect, it will not be received by the children. Rank 0 will exit after the 2s grace period and the children will remain, trying to connect.
They will then have to be killed by flux-start or by srun.

The text was updated successfully, but these errors were encountered:

garlick · 2014-09-29T21:03:01Z

One possible approach:

The "live" module currently executes a "live.hello" handshake that is reduced to rank 0 and used to update kvs conf.live.status nodesets. Ranks are initially in the "unknown" nodeset, then migrate to the "ok" nodeset as they say hello.

Rank 0 could kvs_watch conf.live.status and launch the command only after all ranks have checked in. This would be best moved out of the broker thread and into a new comms module.

Also: event and request networks currently wire up asynchronously. Would need to ensure that event network is live before saying hello, or shutdown event could still be lost.

garlick · 2014-10-04T02:52:07Z

Mark merged pull request #24 above which ensures the main request overlay is connected before rank 0 command is launched.

There is still a potential race with the event overlay as described above so leave this bug open until that is resolved.

garlick · 2014-10-07T15:04:06Z

Actually I'm going to close this one and open a new one on the event stuff.

backport: ensure flux(1) doesn't prepend system path to PYTHONPATH

Problem: unloading resource module with events posted to eventlog in flight can resut in segfault. Program terminated with signal SIGSEGV, Segmentation fault. #0 __strcmp_avx2 () at ../sysdeps/x86_64/multiarch/strcmp-avx2.S:102 102 ../sysdeps/x86_64/multiarch/strcmp-avx2.S: No such file or directory. [Current thread is 1 (Thread 0x7fe74b7fe700 (LWP 3495430))] (gdb) bt #0 __strcmp_avx2 () at ../sysdeps/x86_64/multiarch/strcmp-avx2.S:102 #1 0x00007fe764f40de0 in aux_item_find (key=<optimized out>, head=0x7fe73c006180) at aux.c:88 #2 aux_get (head=<optimized out>, key=0x7fe764f5b000 "flux::log") at aux.c:119 #3 0x00007fe764f1f0d4 in getctx (h=h@entry=0x7fe73c00c6d0) at flog.c:72 #4 0x00007fe764f1f3a5 in flux_vlog (h=0x7fe73c00c6d0, level=7, fmt=0x7fe7606318fc "%s: %s event posted", ap=ap@entry=0x7fe74b7fd790) at flog.c:146 #5 0x00007fe764f1f333 in flux_log (h=<optimized out>, lev=lev@entry=7, fmt=fmt@entry=0x7fe7606318fc "%s: %s event posted") at flog.c:195 flux-framework#6 0x00007fe76061166a in reslog_cb (reslog=<optimized out>, name=0x7fe73c016380 "online", arg=0x7fe73c013000) at acquire.c:319 flux-framework#7 0x00007fe760610deb in notify_callback (event=<optimized out>, reslog=0x7fe73c005b90) at reslog.c:47 flux-framework#8 post_handler (reslog=reslog@entry=0x7fe73c005b90, f=0x7fe73c00a510) at reslog.c:91 flux-framework#9 0x00007fe760611250 in reslog_destroy (reslog=0x7fe73c005b90) at reslog.c:182 flux-framework#10 0x00007fe76060e6b8 in resource_ctx_destroy (ctx=ctx@entry=0x7fe73c016640) at resource.c:129 flux-framework#11 0x00007fe76060ef18 in resource_ctx_destroy (ctx=0x7fe73c016640) at resource.c:331 It looks like the acquire subsystem got a callback for a rank coming online after its context was freed. Set the reslog callback to NULL before destroying the acquire context. Also, set the monitor callback to NULL before destroying the discover context, as it appears this destructor has a similar safety issue.

garlick changed the title ~~flux start <cmd> has unnecessary 2s shutdown delay~~ flux session startup synchronization point is needed Sep 29, 2014

grondo mentioned this issue Oct 1, 2014

flux-start: Eats cmdlines #11

Closed

garlick mentioned this issue Oct 4, 2014

implement startup synchronization #24

Merged

garlick closed this as completed Oct 7, 2014

lipari mentioned this issue Oct 23, 2014

a kvs watch handler cannot modify the key it watches #81

Closed

grondo pushed a commit that referenced this issue Sep 10, 2019

Merge pull request #8 from grondo/fluxpylinkdir-backport

399d3e1

backport: ensure flux(1) doesn't prepend system path to PYTHONPATH

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flux session startup synchronization point is needed #8

flux session startup synchronization point is needed #8

garlick commented Sep 29, 2014

garlick commented Sep 29, 2014

garlick commented Oct 4, 2014

garlick commented Oct 7, 2014

flux session startup synchronization point is needed #8

flux session startup synchronization point is needed #8

Comments

garlick commented Sep 29, 2014

garlick commented Sep 29, 2014

garlick commented Oct 4, 2014

garlick commented Oct 7, 2014