Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python/test: update to new tmpdir scheme #5

Closed
wants to merge 19 commits into from
Closed

Conversation

trws
Copy link

@trws trws commented Sep 1, 2015

FLUX_TMPDIR has been removed, this commit removes references to it from
sideflux and repairs behavior of FLUX_URI.

garlick and others added 19 commits August 31, 2015 22:09
Don't accept "local://" and find the default path from
$FLUX_TMPDIR.  Require the path.

Rename the socket from flux-api to flux-local (cleanup).
Obtain the URI of the local socket by asking the broker with
flux_getattr ("local-uri") instead of deriving it from FLUX_TMPDIR.

Rename the socket from flux-api to flux-local (cleanup).
The broker's environment is set up by its enclosing instance.
Save the value of FLUX_URI, and make it available via
flux_getattr ("parent-uri").  Then unset FLUX_URI so this is not
used accidentally.

Eliminate FLUX_TMPDIR.  When launching locally, use $TMPDIR:-/tmp
for socket paths, and create a subdir structured as before for the
local socket and broker pid.  Repeated local launches from the inital
program of the first do not create sockets in subdirectories of
each other.  They will be flat in $TMPDIR:-/tmp.

However when launching recursively, use the parent-uri to create
the subdir for the local socket and broker pid in a subdirectory
of the parent.  (Other sockets are wildcard paths shared via PMI
so don't apply here).

Set/override FLUX_URI for the initial program and cmb.exec.
Make it available via flux_getattr ("local-uri").

Rename other attributes to avoid confsion between TBON hierarchy
and instance hierarchy:
  parent-uri -> tbon-parent-uri
  request-uri -> tbon-request-uri
This just removes some attribute names in usage messages that
are no longer correct, and probably shouldn't have been there
anyway.
To determine whether to get config from the KVS or config file,
test the FLUX_URI variable, not the deprecated FLUX_TMPDIR variable.
Drop the -T,--tmpdir argument.

Test thet FLUX_URI variable not FLUX_TMPDIR to determine whether
to load config from the KVS or file.
The wreck module sets FLUX_URI in the environment of wrexecd,
and wrexecd overrides it in the environment of the program
being launched.

Obtain the value to use in the wreck module by calling
flux_getattr ("local-uri").
Modify the "flux exec does not pass $FLUX_TMPDIR" test to
use $FLUX_URI instead.
With FLUX_TMPDIR gone and "local://" no longer a valid URI
for the local connector, several tests were just no longer
relevant.
The parent-uri attribute was renamed to tbon-parent-uri for
disambiguation.  Update this user.
get_filtered_envronment() should filter FLUX_URI not FLUX_TMPDIR.
Create a top level, unique temporary directory to contain broker
sockets.

When an instance is launched directly by flux-start, the instance
shares this directory and puts all of its ranks' sockets inside it.
This means flux-start needs to create it and share it with each
broker.  The new --socket-directory option enables this.

A simplified directory structure and naming results, e.g.

  /tmp/flux.5wv21M/event
  /tmp/flux.5wv21M/0/broker.pid
  /tmp/flux.5wv21M/0/local
  /tmp/flux.5wv21M/0/req
  /tmp/flux.5wv21M/1/broker.pid
  /tmp/flux.5wv21M/1/local
  /tmp/flux.5wv21M/1/req
  ...

When an instance is being launched via slurm or flux, each
rank creates its own unique temporary directory.  In the cases
where ranks need to find each others' ipc sockets (such as
when an event relay is active), these URI's are shared via PMI
so it is not necessary for them to be computed relative to a known
directory.
When launching an instance directly, create a temporary directory,
register it with the cleanup handler, then launch each broker with
--socket-directory pointing to it.

All the sockets and pidfiles for the session will be self-contained
in this directory.
Now that the broker is creating its rank-specific directory
and therefore its pidfile in a unique directory, simply fail out
if this directory already exists and skip avoid both the pid
liveness check and the --force option.  This failure mode should
be practically impossible now.
With the socket directory reorganization, sockets get
shorter names, so rename the 'flux-local' socket to just 'local'.
Create a directory in tmp that looks like this
  flux-<sid>-XXXXXX
where XXXXXX is a random component.
If the broker is creating the socket dir, it should look like this:
    flux-<sid>-XXXXXX
where XXXXXX is a random component.

The random component is necessary to avoid name collisions when instances
from overlapping sid spaces are launched.
Match the new instance directory names: flux-sid-XXXXXX.

Eliminate the --all and --top-only options as instance
directories are no longer hierarchical.
FLUX_TMPDIR has been removed, this commit removes references to it from
sideflux and repairs behavior of FLUX_URI.
@garlick garlick closed this Sep 9, 2015
garlick added a commit that referenced this pull request Sep 15, 2020
Problem: unloading resource module with events posted to eventlog
in flight can resut in segfault.

Program terminated with signal SIGSEGV, Segmentation fault.

 #0  __strcmp_avx2 () at ../sysdeps/x86_64/multiarch/strcmp-avx2.S:102
 102     ../sysdeps/x86_64/multiarch/strcmp-avx2.S: No such file or directory.
 [Current thread is 1 (Thread 0x7fe74b7fe700 (LWP 3495430))]
 (gdb) bt
 #0  __strcmp_avx2 () at ../sysdeps/x86_64/multiarch/strcmp-avx2.S:102
 #1  0x00007fe764f40de0 in aux_item_find (key=<optimized out>,
     head=0x7fe73c006180) at aux.c:88
 #2  aux_get (head=<optimized out>, key=0x7fe764f5b000 "flux::log") at aux.c:119
 #3  0x00007fe764f1f0d4 in getctx (h=h@entry=0x7fe73c00c6d0) at flog.c:72
 #4  0x00007fe764f1f3a5 in flux_vlog (h=0x7fe73c00c6d0, level=7,
     fmt=0x7fe7606318fc "%s: %s event posted", ap=ap@entry=0x7fe74b7fd790)
     at flog.c:146
 #5  0x00007fe764f1f333 in flux_log (h=<optimized out>, lev=lev@entry=7,
    fmt=fmt@entry=0x7fe7606318fc "%s: %s event posted") at flog.c:195
 flux-framework#6  0x00007fe76061166a in reslog_cb (reslog=<optimized out>,
     name=0x7fe73c016380 "online", arg=0x7fe73c013000) at acquire.c:319
 flux-framework#7  0x00007fe760610deb in notify_callback (event=<optimized out>,
     reslog=0x7fe73c005b90) at reslog.c:47
 flux-framework#8  post_handler (reslog=reslog@entry=0x7fe73c005b90, f=0x7fe73c00a510)
     at reslog.c:91
 flux-framework#9  0x00007fe760611250 in reslog_destroy (reslog=0x7fe73c005b90)
     at reslog.c:182
 flux-framework#10 0x00007fe76060e6b8 in resource_ctx_destroy (ctx=ctx@entry=0x7fe73c016640)
     at resource.c:129
 flux-framework#11 0x00007fe76060ef18 in resource_ctx_destroy (ctx=0x7fe73c016640)
     at resource.c:331

It looks like the acquire subsystem got a callback for a rank coming online
after its context was freed.  Set the reslog callback to NULL before
destroying the acquire context.

Also, set the monitor callback to NULL before destroying the discover
context, as it appears this destructor has a similar safety issue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants