-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
debugger support #12
Comments
The "proctable" support for wreckrun was removed in 0d547e2 Perhaps we can resurrect this feature, but only enable it as needed? |
Thanks @grondo! Just to set the expectation, this work will only target "being able to debug a job once the target flux instance is identified by the user". IOW, the work won't add those features that allow users to easily debug a job through many nested instances. This work on v0.11 fork will later inform full support for the new execution system.
Finally there will be whole bunch of adjustment at the totalview side as well. |
@grondo: now that I think about this, I may also need |
@grondo: Also what is the easiest way to strip out all the |
See examples from if ((optindex = optparse_parse_args (opts, argc, argv)) < 0)
exit (1);
/* optindex is index of the first non-option argument:
* i.e. new argc, argv = argc-optindex, argv+optindex
*/ |
Yeah, either a fifo or open FD would work here. Instead of a new option, a perhaps simpler approach would be a
Yes, I think you'd want to set a watch on the job |
Yeah! This will surely be much less disruptive. Thanks.
Agreed. |
One more issues. The typedef struct {
char *host_name;
char *executable_name;
int pid;
} MPIR_PROCDESC; So I will need to fetch the hostname for each and every process and executable name. But this, (in particular the hostname) is not a part of the kvs schema, correct? How do you suggest to proceed? |
The rank id to hostname mapping can be found in the (This is for v0.11 only) |
BTW, I noticed that the code in commit 0d547e2 to create the per-task "procdesc" was old enough that it used json-c. So unfortunately the old code isn't as useful as I'd hoped. |
OK. Let see if I understand. We need to reconstruct
Presumably Once that's there, |
Take a look at 0d547e2. We removed it because there were no users, but this code added a key to every task called "procdesc" which contains: { "command": "executable name",
"pid": local_pid,
"nodeid": node_rank
} This is not scalable, but it is easy to start here and improve later. |
Oh good. I think we can have the first level compression by putting the executable names into an array somewhere and reference it from within each per-rank procdesc. (In the same way as the node_rank to hostname lookup). |
Ok, my testing also shows this works fine.
|
@grondo: Just to coordinate, I am pretty far along with this front end. In fact, once |
One question though: my |
I don't think so, you should be able to specify |
Ok, should I go ahead add |
Yes, let's go with that route first and use the lesson learned to optimize for the new execution system. When you have this, could you quickly push it to your existing PR so that I can cherry pick? I know you may want to add test cases and etc, but having the commit early will help me make progress with my portion as well. Thanks @grondo! |
Great. I will test this. Thanks! |
Ok, pushed! Note: the I only quickly added support and a did a sanity test, so apologies if I didn't get it quite right. Will check back in after lunch. |
This will improve performance but won't work for attach mode as is. So yes, we will need an event to generate them on the fly even when the job is already in |
Sure, similar to sending a signal to a job, the event Will that simple approach work? |
Yes, this should work great. Thanks. |
@dongahn, pushed another commit to my PR with support for event |
@grondo: Now that I think about this, there would be one race condition with this on-demand proctable generation for attach mode.
It seems I will need a mechanism to detect the proctable has been dumped in its entirety. A new key attribute like This is not a problem for launch mode. Because waiting for |
Yeah, I thought of that too. The commit is done under a fence so I think once the first procdesc key is in the kvs they are all guaranteed to be there. You should be able to put a WAITCREATE watch on the procdesc for task 0, then fetch all entries after that. We could also put Another approach going forward might be to dump all proctable entries into a single kvs key or keys for each task. We could do this easily by splitting the 3 procdesc entries for each task into their own key and use the aggregator module e.g.
Then Let me know what you think. |
Yes this sounds good. For PIDs, I wonder if delta encoding or range encoding will also still be effective. If you can group these PIDs per flux rank, the PID would be likely to be close to one another (likley consecutive) and the list should be compressed pretty well with these encoding. Each PID in the group will be stored with flux rank as the key as a compressed form. Of cause we still have to map each MPI rank to the flux rank+pid pair. But I wonder that can be reconstructed with an implicit rule. |
It looks like this is a comma delimited string, correct? (That's fine. I just want to double check.) More importantly, what will be the mechanism for the new execution system? |
It is in hostlist format, e.g.: grondo@fluke108:~$ salloc -N8
salloc: Granted job allocation 323
grondo@fluke43:~$ flux kvs get resource.hosts
"fluke[43-44,47-52]"
grondo@fluke43:~$
Hm, the rank-to-hostname mapping is separate from the execution service. I suppose we've removed support for |
I see. I wasn't sure because I am using docker on osx and was getting ahn1@b279da58591f:/usr/src/src/cmd$ flux kvs get resource.hosts
"b279da58591f,b279da58591f" since it has no basename. Do we have a library to deserialize hostlist or you are own your own to parse it? |
In v0.11 we have
you could link with directly for now. We pulled this code out of flux as part of the wreck and Lua purge, so it isn't available in upstream flux-core repo. We'll have to figure out some other format there. |
@dongahn, to link with
One existing example in tree is If that doesn't work, maybe it is better if we compile |
Adding index 9a3b248f..ad28f7d1 100644
--- a/src/bindings/lua/Makefile.am
+++ b/src/bindings/lua/Makefile.am
@@ -54,7 +54,8 @@ check_LTLIBRARIES = \
noinst_LTLIBRARIES = \
libfluxlua.la \
- lalarm.la
+ lalarm.la \
+ libhostlist.la
luamod_ldflags = \
-avoid-version -module -shared --disable-static \
@@ -123,6 +124,10 @@ lalarm_la_LDFLAGS = \
lalarm_la_LIBADD = \
$(LUA_LIB)
+libhostlist_la_SOURCES = \
+ lua-hostlist/hostlist.c \
+ lua-hostlist/hostlist.h
+ @grondo: Do you want |
Yes, if you can replace |
Got it. |
Hey @grondo and @SteVwonder: I am using #define DEBUG(fmt,...) do { \
if (verbose) log_msg(fmt, ##__VA_ARGS__); \
} while (0) I believe |
OK. A sign of life! I was able to debug 64 MPI hello world using totalview! Obviously a lot more testing to do but the initial result is encouraging. BTW, it seems this will be a real bug hunting as mpi hello world program compiled with mpicc was hung for this on quartz. This and OpenMPI issues recently reported by @damora and @SteVwonder suggest that we will need more MPI testing coverage... |
@grondo: Attach mode mostly worked but there still is a race between proctable generation and quartz770{dahn}121: /g/g0/dahn/workspace/flux_tool/bin/flux job-debug --attach -v 16
flux-job-debug: resource.hosts (quartz[770,770])
97051
flux-job-debug: flux_event_publish_pack: wreck.16.proctable succeeded.
flux-job-debug: job state: running
flux-job-debug: nnodes (2), ntasks (64), ncores (64), ngpus (0)
flux-job-debug: totalview_jobid (16)
flux-job-debug: MPIR_proctable_size (64)
flux-job-debug: Rank (0): exec (/g/g0/dahn/workspace/flux_tool/bin/hw), pid (96830), nodeid (0)
flux-job-debug: flux_kvs_lookup_get_unpack for procdesc: No such file or directory The code is here: flux-core-v0.11/src/cmd/flux-job-debug.c Line 226 in ee01195
I used a WAITCREATE watch on the rank 0 to break any race and then plain kvs lookup. I may be missing something here, though. @grondo: can you be a second pair of eyes? If there is indeed no guarantee from the producer, I can use a WAITCREATE watch for every task at the cost of performance overhead. |
Ok, I will take a look. |
@dongahn, if I understand the code correctly it appears you are calling Since you initiate the lookups all at once, some of the non-rank-0 lookups are pretty much guaranteed to fail with ENOENT. I would either add the WAITCREATE flag to all lookups, or only issue the rank 0 lookup at first, and have the fulfillment of that future trigger the remainder of the lookups. I may not completely understand what is going on in |
Ah thanks. Yes you are correct. I misunderstood how flux future works. There is no ordering among asyn requests like this. Makes perfect sense for performance. I will hoist up the rank 0 request out of the loop as a "blocking" WAITCREATE and move onto the main loop! |
@grondo: For attach testing, I wanted to start up wreckrun in the background (e.g., flux wreckrun -N 4 -n 64 ../../../bin/hw, Ctrl-Z, bg sequence). I'm getting
I initially thought this was because wreckrun still wants to access
stil exhibit the issue. |
For now you might want to use |
#10. This fixed? |
Yes, in my current PR branch. BTW, I can't reproduce the problem above with |
Hmm. I am using tcsh on quartz. I saw this issue on my Mac osx docker as well. quartz770{dahn}24: flux wreckrun -i /dev/null -N 1 -n2 sleep 20 &
[2] 41666
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
[2] + Suspended (tty input) flux wreckrun -i /dev/null -N 1 -n2 sleep 20 |
Ah, yes, I can occasionally reproduce a problem if I try to run flux-wreckrun directly in the background. Maybe also try redirecting input of |
I did some manual smoke tests with totalview and the current version works great (w/o server bulk launch of course). I've asked @lee218llnl to suggest a STAT installation for which I can modify the configuration files for manual testing as well. Given the testing results I got so far, I will soon go ahead and add some sharness tests with a goal to land this PR sooner rather than later. |
Well, I looked at the version and realized STAT/LaunchMON can't launch tools deamons without bulk launch capability. So, I will work on sharness test cases first. Then when my PR gets merged with @grondo's |
Kind of a convoluted corner case: I start wreckrun with --options=stop-children-in-exec to see if I can attach totalview when the state is
I think this is probably I tested this for the case where the initial job-debug session somehow failed and users want to attach to this job using another job-debug session. (Would be very very rare). I don't know if I want |
Moving flux-core-v0.11 specific discussion here for flux-framework/flux-core#2163.
After quick discussion with @dongahn, items required for parallel debugger support in v0.11 include:
flux job-debug
frontend command with support fortotalview flux job-debug wrecrkun ...
andflux job-debug --attach=ID
flux job-debug
is able to gather per task PIDs and executables--jobid
option forwreckrun
to run 1 task per rank against an existing job (used for TV bulk launch)The text was updated successfully, but these errors were encountered: