python/test: update to new tmpdir scheme #5

trws · 2015-09-01T21:21:12Z

FLUX_TMPDIR has been removed, this commit removes references to it from
sideflux and repairs behavior of FLUX_URI.

Don't accept "local://" and find the default path from $FLUX_TMPDIR. Require the path. Rename the socket from flux-api to flux-local (cleanup).

Obtain the URI of the local socket by asking the broker with flux_getattr ("local-uri") instead of deriving it from FLUX_TMPDIR. Rename the socket from flux-api to flux-local (cleanup).

The broker's environment is set up by its enclosing instance. Save the value of FLUX_URI, and make it available via flux_getattr ("parent-uri"). Then unset FLUX_URI so this is not used accidentally. Eliminate FLUX_TMPDIR. When launching locally, use $TMPDIR:-/tmp for socket paths, and create a subdir structured as before for the local socket and broker pid. Repeated local launches from the inital program of the first do not create sockets in subdirectories of each other. They will be flat in $TMPDIR:-/tmp. However when launching recursively, use the parent-uri to create the subdir for the local socket and broker pid in a subdirectory of the parent. (Other sockets are wildcard paths shared via PMI so don't apply here). Set/override FLUX_URI for the initial program and cmb.exec. Make it available via flux_getattr ("local-uri"). Rename other attributes to avoid confsion between TBON hierarchy and instance hierarchy: parent-uri -> tbon-parent-uri request-uri -> tbon-request-uri

This just removes some attribute names in usage messages that are no longer correct, and probably shouldn't have been there anyway.

To determine whether to get config from the KVS or config file, test the FLUX_URI variable, not the deprecated FLUX_TMPDIR variable.

Drop the -T,--tmpdir argument. Test thet FLUX_URI variable not FLUX_TMPDIR to determine whether to load config from the KVS or file.

The wreck module sets FLUX_URI in the environment of wrexecd, and wrexecd overrides it in the environment of the program being launched. Obtain the value to use in the wreck module by calling flux_getattr ("local-uri").

Modify the "flux exec does not pass $FLUX_TMPDIR" test to use $FLUX_URI instead.

With FLUX_TMPDIR gone and "local://" no longer a valid URI for the local connector, several tests were just no longer relevant.

The parent-uri attribute was renamed to tbon-parent-uri for disambiguation. Update this user.

get_filtered_envronment() should filter FLUX_URI not FLUX_TMPDIR.

Create a top level, unique temporary directory to contain broker sockets. When an instance is launched directly by flux-start, the instance shares this directory and puts all of its ranks' sockets inside it. This means flux-start needs to create it and share it with each broker. The new --socket-directory option enables this. A simplified directory structure and naming results, e.g. /tmp/flux.5wv21M/event /tmp/flux.5wv21M/0/broker.pid /tmp/flux.5wv21M/0/local /tmp/flux.5wv21M/0/req /tmp/flux.5wv21M/1/broker.pid /tmp/flux.5wv21M/1/local /tmp/flux.5wv21M/1/req ... When an instance is being launched via slurm or flux, each rank creates its own unique temporary directory. In the cases where ranks need to find each others' ipc sockets (such as when an event relay is active), these URI's are shared via PMI so it is not necessary for them to be computed relative to a known directory.

When launching an instance directly, create a temporary directory, register it with the cleanup handler, then launch each broker with --socket-directory pointing to it. All the sockets and pidfiles for the session will be self-contained in this directory.

Now that the broker is creating its rank-specific directory and therefore its pidfile in a unique directory, simply fail out if this directory already exists and skip avoid both the pid liveness check and the --force option. This failure mode should be practically impossible now.

With the socket directory reorganization, sockets get shorter names, so rename the 'flux-local' socket to just 'local'.

Create a directory in tmp that looks like this flux-<sid>-XXXXXX where XXXXXX is a random component.

If the broker is creating the socket dir, it should look like this: flux-<sid>-XXXXXX where XXXXXX is a random component. The random component is necessary to avoid name collisions when instances from overlapping sid spaces are launched.

Match the new instance directory names: flux-sid-XXXXXX. Eliminate the --all and --top-only options as instance directories are no longer hierarchical.

FLUX_TMPDIR has been removed, this commit removes references to it from sideflux and repairs behavior of FLUX_URI.

Problem: unloading resource module with events posted to eventlog in flight can resut in segfault. Program terminated with signal SIGSEGV, Segmentation fault. #0 __strcmp_avx2 () at ../sysdeps/x86_64/multiarch/strcmp-avx2.S:102 102 ../sysdeps/x86_64/multiarch/strcmp-avx2.S: No such file or directory. [Current thread is 1 (Thread 0x7fe74b7fe700 (LWP 3495430))] (gdb) bt #0 __strcmp_avx2 () at ../sysdeps/x86_64/multiarch/strcmp-avx2.S:102 #1 0x00007fe764f40de0 in aux_item_find (key=<optimized out>, head=0x7fe73c006180) at aux.c:88 #2 aux_get (head=<optimized out>, key=0x7fe764f5b000 "flux::log") at aux.c:119 #3 0x00007fe764f1f0d4 in getctx (h=h@entry=0x7fe73c00c6d0) at flog.c:72 #4 0x00007fe764f1f3a5 in flux_vlog (h=0x7fe73c00c6d0, level=7, fmt=0x7fe7606318fc "%s: %s event posted", ap=ap@entry=0x7fe74b7fd790) at flog.c:146 #5 0x00007fe764f1f333 in flux_log (h=<optimized out>, lev=lev@entry=7, fmt=fmt@entry=0x7fe7606318fc "%s: %s event posted") at flog.c:195 flux-framework#6 0x00007fe76061166a in reslog_cb (reslog=<optimized out>, name=0x7fe73c016380 "online", arg=0x7fe73c013000) at acquire.c:319 flux-framework#7 0x00007fe760610deb in notify_callback (event=<optimized out>, reslog=0x7fe73c005b90) at reslog.c:47 flux-framework#8 post_handler (reslog=reslog@entry=0x7fe73c005b90, f=0x7fe73c00a510) at reslog.c:91 flux-framework#9 0x00007fe760611250 in reslog_destroy (reslog=0x7fe73c005b90) at reslog.c:182 flux-framework#10 0x00007fe76060e6b8 in resource_ctx_destroy (ctx=ctx@entry=0x7fe73c016640) at resource.c:129 flux-framework#11 0x00007fe76060ef18 in resource_ctx_destroy (ctx=0x7fe73c016640) at resource.c:331 It looks like the acquire subsystem got a callback for a rank coming online after its context was freed. Set the reslog callback to NULL before destroying the acquire context. Also, set the monitor callback to NULL before destroying the discover context, as it appears this destructor has a similar safety issue.

garlick and others added 19 commits August 31, 2015 22:09

connectors/local: require path to be set

4bb8c0b

Don't accept "local://" and find the default path from $FLUX_TMPDIR. Require the path. Rename the socket from flux-api to flux-local (cleanup).

modules/connector-local: use flux_getattr

ceb1e11

Obtain the URI of the local socket by asking the broker with flux_getattr ("local-uri") instead of deriving it from FLUX_TMPDIR. Rename the socket from flux-api to flux-local (cleanup).

cmd/flux-comms: eliminate Usage refs to old attrs

bebc5ab

This just removes some attribute names in usage messages that are no longer correct, and probably shouldn't have been there anyway.

cmd/flux-config: test FLUX_URI not FLUX_TMPDIR

e81922e

To determine whether to get config from the KVS or config file, test the FLUX_URI variable, not the deprecated FLUX_TMPDIR variable.

cmd/flux: replace FLUX_TMPDIR with FLUX_URI

2d321b8

Drop the -T,--tmpdir argument. Test thet FLUX_URI variable not FLUX_TMPDIR to determine whether to load config from the KVS or file.

modules/wreck: set FLUX_URI

d92d83f

The wreck module sets FLUX_URI in the environment of wrexecd, and wrexecd overrides it in the environment of the program being launched. Obtain the value to use in the wreck module by calling flux_getattr ("local-uri").

test/exec: make FLUX_TMPDIR test use FLUX_URI

a4654b5

Modify the "flux exec does not pass $FLUX_TMPDIR" test to use $FLUX_URI instead.

test/cmddriver: drop several irrelevant tests

b074157

With FLUX_TMPDIR gone and "local://" no longer a valid URI for the local connector, several tests were just no longer relevant.

modules/live: obtain tbon-parent-uri by new name

2b61351

The parent-uri attribute was renamed to tbon-parent-uri for disambiguation. Update this user.

bindings/lua: wreck.lua should use FLUX_URI

583fae8

get_filtered_envronment() should filter FLUX_URI not FLUX_TMPDIR.

connectors/local: rename flux-local to local

81829f2

With the socket directory reorganization, sockets get shorter names, so rename the 'flux-local' socket to just 'local'.

cmd/flux-start: include sid in socket dir

f8a3996

Create a directory in tmp that looks like this flux-<sid>-XXXXXX where XXXXXX is a random component.

broker: include sid in socket idr

674e89e

If the broker is creating the socket dir, it should look like this: flux-<sid>-XXXXXX where XXXXXX is a random component. The random component is necessary to avoid name collisions when instances from overlapping sid spaces are launched.

cmd/flux-list-instances: grok new directory names

8ec80b9

Match the new instance directory names: flux-sid-XXXXXX. Eliminate the --all and --top-only options as instance directories are no longer hierarchical.

python/test: update to new tmpdir scheme

bca64da

FLUX_TMPDIR has been removed, this commit removes references to it from sideflux and repairs behavior of FLUX_URI.

garlick force-pushed the tmpdir branch from e8bab97 to 1de5bfe Compare September 1, 2015 22:17

garlick closed this Sep 9, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python/test: update to new tmpdir scheme #5

python/test: update to new tmpdir scheme #5

trws commented Sep 1, 2015

python/test: update to new tmpdir scheme #5

python/test: update to new tmpdir scheme #5

Conversation

trws commented Sep 1, 2015