Replies: 16 comments
-
Does
I'm not sure it is Is there a corefile from which you can determine the actual program that segfaulted? |
Beta Was this translation helpful? Give feedback.
-
It might also be helpful to add |
Beta Was this translation helpful? Give feedback.
-
I'm having trouble getting the pmi server embedded in the shell to emit debug output. It looks like it should be enabled if we use |
Beta Was this translation helpful? Give feedback.
-
I thought those messages were logged at "trace" level,, so try |
Beta Was this translation helpful? Give feedback.
-
Oh indeed, that worked. Sorry for the noise. |
Beta Was this translation helpful? Give feedback.
-
I actually got to the bottom of this problem last night but didn't have the time to update the ticket. You will be surprised how convoluted this is! I will update the ticket. In the meantime, please don't spend more time on this. |
Beta Was this translation helpful? Give feedback.
-
I found What's really odd is who is crashing is I believe environment is set up such as way that
I am guessing this might be coming from LSF as googling the |
Beta Was this translation helpful? Give feedback.
-
Simplest reproducer: bash-4.2$ LD_PRELOAD=/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/container/../lib/libpami_cudahook.so /lib64/libc.so.6
Segmentation fault (core dumped) |
Beta Was this translation helpful? Give feedback.
-
I will report this to Roy M and if either system admins or IBM can fix it. I am not sure if we can easily work around this by unsetting |
Beta Was this translation helpful? Give feedback.
-
Two more problems to address to make Flux working most naturally on the new software level on rzansel. It is kind of important to get those problems fixed because this software stack will be deployed Next week Aug 5! jsrun flux start hands with the pmi-shim. Stephen came up with a patch when it was working on Summit.diff --git a/src/pmi1.c b/src/pmi1.c
index 2295ad0..e9a3bbf 100644
--- a/src/pmi1.c
+++ b/src/pmi1.c
@@ -60,6 +60,8 @@ PMIX_EXPORT int PMI_Init(int *spawned)
pmix_info_t info[1];
bool val_optinal = 1;
+ //fprintf(stderr, "Initializing PMIx via the pmi-shim\n");
+
if (PMIX_SUCCESS != (rc = PMIx_Init(&myproc, NULL, 0))) {
/* if we didn't see a PMIx server (e.g., missing envar),
* then allow us to run as a singleton */
@@ -175,6 +177,8 @@ PMIX_EXPORT int PMI_KVS_Put(const char kvsname[], const char key[], const char v
return PMI_SUCCESS;
}
+ //fprintf(stderr, "Putting %s into %s\n", value, key);
+
pmix_output_verbose(2, pmix_globals.debug_output,
"PMI_KVS_Put: KVS=%s, key=%s value=%s", kvsname, key, value);
@@ -257,6 +261,9 @@ PMIX_EXPORT int PMI_KVS_Get( const char kvsname[], const char key[], char value[
if(sscanf(key, "cmbd.%u.uri", &scanned_rank) > 0) {
proc.rank = scanned_rank;
//fprintf(stderr, "Using rank %u for get of %s\n", scanned_rank, key);
+ } else if(strncmp(key, "flux.instance-level", 21) == 0) {
+ //fprintf(stderr, "Error'ing out for get of %s\n", key);
+ return -1;
} else {
proc.rank = PMIX_RANK_UNDEF;
//fprintf(stderr, "Using rank undefined for get of %s\n", key);
@@ -268,6 +275,7 @@ PMIX_EXPORT int PMI_KVS_Get( const char kvsname[], const char key[], char value[
rc = PMIX_ERROR;
} else if (NULL != val->data.string) {
pmix_strncpy(value, val->data.string, length-1);
+ //fprintf(stderr, "Successfully got %s for %s\n", value, key);
}
PMIX_VALUE_RELEASE(val);
} I will clean up this patch and create a new pmi-shim package. I will test if the new pmi-shim works both for Lassen and rzansel. flux mini run segfault on an application built with the default MPI on rzansel.
This is again the same issue Stephen encountered on Summit. I will report this to Roy. As Veronica from ORNL and we report the same issue to IBM. IBM may find us a work around sooner. In the meantime, I confirmed that our side installed MPI (currently used on Lassen) still works on rzansel with Flux. So our users should keep using that version until this problem is fixed. |
Beta Was this translation helpful? Give feedback.
-
A caveat is you now need to specify the old MPI library path ether with
|
Beta Was this translation helpful? Give feedback.
-
Or
Prepending to
It prevents the top |
Beta Was this translation helpful? Give feedback.
-
Does that make the |
Beta Was this translation helpful? Give feedback.
-
It is still there. My hope is IBM's fix for this make that warnings go away. But I cannot test this because of the new problem: #3095 (comment) |
Beta Was this translation helpful? Give feedback.
-
Convert this to discussion as well since this is a partner issue. Let me mark some W/R as answers as well. |
Beta Was this translation helpful? Give feedback.
-
Bummer. You can only select one posting as "answer". So when we use this discussion feature, we may need a summary posting and select it as "answer." |
Beta Was this translation helpful? Give feedback.
-
This is w/ the newly tagged versions.
It seems certain environment variables or similar set by jsrun is causing flux broker to segfault.
Beta Was this translation helpful? Give feedback.
All reactions