Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

debugger support #12

Closed
4 tasks done
grondo opened this issue May 20, 2019 · 52 comments
Closed
4 tasks done

debugger support #12

grondo opened this issue May 20, 2019 · 52 comments

Comments

@grondo
Copy link
Contributor

grondo commented May 20, 2019

Moving flux-core-v0.11 specific discussion here for flux-framework/flux-core#2163.

After quick discussion with @dongahn, items required for parallel debugger support in v0.11 include:

  • support for "sync" event and stop-in-exec support for wreck. (Nominally working but needs to be tested)
  • New flux job-debug frontend command with support for totalview flux job-debug wrecrkun ... and flux job-debug --attach=ID
  • Resurrect per task "proctable" support in wreckrun so that flux job-debug is able to gather per task PIDs and executables
  • New --jobid option for wreckrun to run 1 task per rank against an existing job (used for TV bulk launch)
@grondo
Copy link
Contributor Author

grondo commented May 20, 2019

The "proctable" support for wreckrun was removed in 0d547e2

Perhaps we can resurrect this feature, but only enable it as needed?

@dongahn
Copy link
Member

dongahn commented May 20, 2019

Thanks @grondo! Just to set the expectation, this work will only target "being able to debug a job once the target flux instance is identified by the user". IOW, the work won't add those features that allow users to easily debug a job through many nested instances. This work on v0.11 fork will later inform full support for the new execution system.

  1. Use flux job-debug for launch mode and flux job-debug --attach for attached mode. One thing I will need to check early on is whether totalview --args flux job-debug... will seamlessly work for totalview's restarting of a debug session.

  2. Gather proctable into flux-job-debug from wreck subsystem. Thanks @grondo for the past commit.

  3. Introduce --jobid=<jobid> option to wreckrun which totalview can use to bulk launch its debugger daemons.

  4. Will use an event to tell wreck to send SIGCONT to each and every MPI process to clear out of the debugger barrier.

  5. Won't worry about how to avoid symbol stripping and such. This will be revisited for RPM as a separate work item.

Finally there will be whole bunch of adjustment at the totalview side as well.

@dongahn
Copy link
Member

dongahn commented May 20, 2019

@grondo: now that I think about this, I may also need flux-wreckrun --jobid-fifo=FIFOPATH or a similar. flux-job-debug will need to fork() and exec() flux-wreckrun and then need to be synchronized with flux-wreckrun. But it is not clear how to do this atomically. For example, the racefree jobid can only come from the job create rpc performed by wreckrun. flux-job-debug can create a FIFO and then send its path to wreckrun which then return its jobid to the FIFO. Once the jobid is returned, I suspect the rest of synchronization such as waiting until all information including PIDs is filled up can be done through KVS?

@dongahn
Copy link
Member

dongahn commented May 20, 2019

@grondo: Also what is the easiest way to strip out all the flux-job-debug options from optparse? This is so that I can fork() and exec() the remaining of arguments which are wreckrun command line?

@grondo
Copy link
Contributor Author

grondo commented May 20, 2019

@grondo: Also what is the easiest way to strip out all the flux-job-debug options from optparse?

See examples from flux-start.c and/or flux-exec., each of which creates a new argv,argc from remaining arguments after all option arguments have been parsed. E.g:

    if ((optindex = optparse_parse_args (opts, argc, argv)) < 0)
        exit (1);
    /* optindex is index of the first non-option argument:
     *  i.e. new argc, argv = argc-optindex, argv+optindex
     */

@grondo
Copy link
Contributor Author

grondo commented May 20, 2019

flux-job-debug will need to fork() and exec() flux-wreckrun and then need to be synchronized with flux-wreckrun.

Yeah, either a fifo or open FD would work here. Instead of a new option, a perhaps simpler approach would be a FLUX_WRECKRUN_JOBID_FD environment variable. If present wreckrun would emit the obtained jobid over this file descriptor.

Once the jobid is returned, I suspect the rest of synchronization such as waiting until all information including PIDs is filled up can be done through KVS?

Yes, I think you'd want to set a watch on the job state and once the job was in the appropriate state then all procdesc information is guaranteed to be live in the kvs.

@dongahn
Copy link
Member

dongahn commented May 20, 2019

a perhaps simpler approach would be a FLUX_WRECKRUN_JOBID_FD environment variable.

Yeah! This will surely be much less disruptive. Thanks.

Yes, I think you'd want to set a watch on the job state and once the job was in the appropriate state then all procdesc information is guaranteed to be live in the kvs.

Agreed.

@dongahn
Copy link
Member

dongahn commented May 22, 2019

@grondo:

One more issues.

The MPIR_PROCDESC is defined in the MPIR debug API spec:

typedef struct {
    char *host_name;
    char *executable_name;
    int pid;
} MPIR_PROCDESC;

So I will need to fetch the hostname for each and every process and executable name. But this, (in particular the hostname) is not a part of the kvs schema, correct? How do you suggest to proceed?

@grondo
Copy link
Contributor Author

grondo commented May 22, 2019

The rank id to hostname mapping can be found in the resource.hosts kvs key.

(This is for v0.11 only)

@grondo
Copy link
Contributor Author

grondo commented May 22, 2019

BTW, I noticed that the code in commit 0d547e2 to create the per-task "procdesc" was old enough that it used json-c. So unfortunately the old code isn't as useful as I'd hoped.

@dongahn
Copy link
Member

dongahn commented May 22, 2019

The rank id to hostname mapping can be found in the resource.hosts kvs key.

OK. Let see if I understand. We need to reconstruct MPIR_Proctable that has an MPIR_DESCDESC entry for each and every MPI process in MPI_COMM_WORLD rank order. The only per-process info I can find from the current schema is:

lwj.0.0.13.0.stdin -> lwj.0.0.13.input.files.stdin
lwj.0.0.13.1.stdin -> lwj.0.0.13.input.files.stdin
lwj.0.0.13.2.stdin -> lwj.0.0.13.input.files.stdin
lwj.0.0.13.3.stdin -> lwj.0.0.13.input.files.stdin

Presumably 0, 1, 2, and 3 here are the MPI processes' MPI_COMM_WORLD ranks? Are they KVS directories? If so, we will need to put additional metadata under these: flux ranks and UNIX pids?

Once that's there, job-debug can fetch the flux ranks and resolve them into hostnames by looking up the resource.hosts?

@grondo
Copy link
Contributor Author

grondo commented May 22, 2019

Take a look at 0d547e2. We removed it because there were no users, but this code added a key to every task called "procdesc" which contains:

{ "command": "executable name",
   "pid": local_pid,
   "nodeid": node_rank
}

This is not scalable, but it is easy to start here and improve later.

@dongahn
Copy link
Member

dongahn commented May 22, 2019

Oh good. I think we can have the first level compression by putting the executable names into an array somewhere and reference it from within each per-rank procdesc. (In the same way as the node_rank to hostname lookup).

@dongahn
Copy link
Member

dongahn commented May 23, 2019

support for "sync" event and stop-in-exec support for wreck. (Nominally working but needs to be tested)

Ok, my testing also shows this works fine.

flux-job-debug passes --options=stop-children-in-exec to wreckrun so wreck starts the remote processes in a stopped state. I think all I need at this point is to "safely" wait until the job state becomes sync at which point to gather proctable. BTW, my progress will be a bit slower next few days as I have to work on some other urgent issues.

@dongahn
Copy link
Member

dongahn commented May 25, 2019

New flux job-debug frontend command with support for totalview flux job-debug wrecrkun ... and flux job-debug --attach=ID

@grondo: Just to coordinate, I am pretty far along with this front end. In fact, once wreck starts generate procdescs, I should be able to start testing my front end with our debuggers. Have a good long weekend!

@dongahn
Copy link
Member

dongahn commented May 25, 2019

One question though: my flux-job-debug front-end currently inserts --options=stop-children-in-exec into wreckrun command line for launch mode. Would this have any side effect if the user also specified other arguments to --options? If so, is there another mechanism to effect the sync operation within wreck without incurring any side effect?

@grondo
Copy link
Contributor Author

grondo commented May 28, 2019

Would this have any side effect if the user also specified other arguments to --options?

I don't think so, you should be able to specify --options multiple times, so stop-children-in-exec should just be added to lwj.x.y.z.options. To test, you could try running a job under flux job-debug that uses -o kz and ensure that kz and stop-children-in-exec both appear in the options object in the kvs dir for the job.

@grondo
Copy link
Contributor Author

grondo commented May 28, 2019

In fact, once wreck starts generate procdescs, I should be able to start testing my front end with our debuggers.

Ok, should I go ahead add lwj.<task>.procdesc back to wrexecd in the same format as we had before?

@dongahn
Copy link
Member

dongahn commented May 28, 2019

Ok, should I go ahead add lwj..procdesc back to wrexecd in the same format as we had before?

Yes, let's go with that route first and use the lesson learned to optimize for the new execution system.

When you have this, could you quickly push it to your existing PR so that I can cherry pick? I know you may want to add test cases and etc, but having the commit early will help me make progress with my portion as well. Thanks @grondo!

@dongahn
Copy link
Member

dongahn commented May 28, 2019

I don't think so, you should be able to specify --options multiple times, so stop-children-in-exec should just be added to lwj.x.y.z.options. To test, you could try running a job under flux job-debug that uses -o kz and ensure that kz and stop-children-in-exec both appear in the options object in the kvs dir for the job

Great. I will test this. Thanks!

@grondo
Copy link
Contributor Author

grondo commented May 28, 2019

When you have this, could you quickly push it to your existing PR so that I can cherry pick?

Ok, pushed! Note: the procdesc entry will only be created when stop-children-in-exec option is used (for now). Perhaps we can add support for an event that will trigger wrexecd to generate these entries on demand as well.

I only quickly added support and a did a sanity test, so apologies if I didn't get it quite right. Will check back in after lunch.

@dongahn
Copy link
Member

dongahn commented May 28, 2019

Ok, pushed! Note: the procdesc entry will only be created when stop-children-in-exec option is used (for now). Perhaps we can add support for an event that will trigger wrexecd to generate these entries on demand as well.

This will improve performance but won't work for attach mode as is. So yes, we will need an event to generate them on the fly even when the job is already in running state. Can you propose such an event -- the one that wreck and job-debug can agree upon?

@grondo
Copy link
Contributor Author

grondo commented May 28, 2019

we will need an event to generate them on the fly even when the job is already in running state. Can you propose such an event -- the one that wreck and job-debug can agree upon?

Sure, similar to sending a signal to a job, the event wrexec.<jobid>.proctable will cause the wrexecds for job jobid to dump proctable entries for every task to the kvs.

Will that simple approach work?

@dongahn
Copy link
Member

dongahn commented May 28, 2019

Sure, similar to sending a signal to a job, the event wrexec.<jobid>.proctable will cause the wrexecds for job jobid to dump proctable entries for every task to the kvs.

Will that simple approach work?

Yes, this should work great. Thanks.

@grondo
Copy link
Contributor Author

grondo commented May 28, 2019

@dongahn, pushed another commit to my PR with support for event wreck.<jobid>.proctable (sorry, I forgot the prefix was wreck. not wrexec.)

@dongahn
Copy link
Member

dongahn commented May 29, 2019

@dongahn, pushed another commit to my PR with support for event wreck..proctable (sorry, I forgot the prefix was wreck. not wrexec.

@grondo: Thanks grondo. Yes, I understood it that way given the single event name.

@dongahn
Copy link
Member

dongahn commented May 29, 2019

@grondo: Now that I think about this, there would be one race condition with this on-demand proctable generation for attach mode.

flux-job-debug will send wreck.<jobid>.proctable and then start to fetch MPIR info. But there is no guarantee that the proctable will be filled up by the time, job-debug fetches this attribute.

It seems I will need a mechanism to detect the proctable has been dumped in its entirety. A new key attribute like proctable=dumped for which I can do "watch"? Or I can just use watch once type to fetch each and every proctable entity. Not sure whether this will have performance issues though.

This is not a problem for launch mode. Because waiting forstate=sync creates a natural synchronization.

@grondo
Copy link
Contributor Author

grondo commented May 29, 2019

Yeah, I thought of that too. The commit is done under a fence so I think once the first procdesc key is in the kvs they are all guaranteed to be there. You should be able to put a WAITCREATE watch on the procdesc for task 0, then fetch all entries after that.

We could also put proctable=dumped under that same fence, but it seems redundant.

Another approach going forward might be to dump all proctable entries into a single kvs key or keys for each task. We could do this easily by splitting the 3 procdesc entries for each task into their own key and use the aggregator module e.g.

proctable.command = {"[0-32]": "hostname"}
proctable.noderank = {"[0-16]": 0, "[17-31]": 1}
proctable.pid = {"0": 1234,
                 "1": 5678, ... }

Then flux job-debug would only need to read 3 keys instead of N per task. The proctable.pid key might grow very large though (even though there is some probability of processes across large jobs sharing pids.)

Let me know what you think.

@dongahn
Copy link
Member

dongahn commented May 29, 2019

Yes this sounds good. For PIDs, I wonder if delta encoding or range encoding will also still be effective. If you can group these PIDs per flux rank, the PID would be likely to be close to one another (likley consecutive) and the list should be compressed pretty well with these encoding. Each PID in the group will be stored with flux rank as the key as a compressed form. Of cause we still have to map each MPI rank to the flux rank+pid pair. But I wonder that can be reconstructed with an implicit rule.

@dongahn
Copy link
Member

dongahn commented May 29, 2019

The rank id to hostname mapping can be found in the resource.hosts kvs key.
(This is for v0.11 only)

It looks like this is a comma delimited string, correct? (That's fine. I just want to double check.) More importantly, what will be the mechanism for the new execution system?

@grondo
Copy link
Contributor Author

grondo commented May 29, 2019

It looks like this is a comma delimited string, correct? (That's fine. I just want to double check.) More importantly, what will be the mechanism for the new execution system?

It is in hostlist format, e.g.:

grondo@fluke108:~$ salloc -N8
salloc: Granted job allocation 323
grondo@fluke43:~$ flux kvs get resource.hosts
"fluke[43-44,47-52]"
grondo@fluke43:~$

More importantly, what will be the mechanism for the new execution system?

Hm, the rank-to-hostname mapping is separate from the execution service. I suppose we've removed support for resource.hosts in the flux-core upstream repo so we'll have to come up with some other way to store this mapping.

@dongahn
Copy link
Member

dongahn commented May 29, 2019

It is in hostlist format

I see. I wasn't sure because I am using docker on osx and was getting

ahn1@b279da58591f:/usr/src/src/cmd$ flux kvs get resource.hosts
"b279da58591f,b279da58591f"

since it has no basename.

Do we have a library to deserialize hostlist or you are own your own to parse it?

@grondo
Copy link
Contributor Author

grondo commented May 29, 2019

In v0.11 we have

src/bindings/lua/lua-hostlist/hostlist.h
src/bindings/lua/lua-hostlist/hostlist.c

you could link with directly for now. We pulled this code out of flux as part of the wreck and Lua purge, so it isn't available in upstream flux-core repo. We'll have to figure out some other format there.

@grondo
Copy link
Contributor Author

grondo commented May 30, 2019

@dongahn, to link with lua-hostlist/hostlist.c you may be able to list the full path to the .lo directly in the Makefile, e.g.

flux_job_debug_SOURCES = $(top_builddir)/src/bindings/lua/lua-hostlist/hostlist.lo

One existing example in tree is libflux_idset_la_SOURCES in src/common/Makefile.am,
though this is for a library not an executable, so not a great example.

If that doesn't work, maybe it is better if we compile hostlist.[ch] into a libhostlist.la convenience library, which will make linking more straightforward. (I can do that for you on my PR branch).

@dongahn
Copy link
Member

dongahn commented May 30, 2019

If that doesn't work, maybe it is better if we compile hostlist.[ch] into a libhostlist.la convenience library, which will make linking more straightforward. (I can do that for you on my PR branch).

Adding hostlist.lo didn't work quite well. So I just created the libhostlist.la convenience library in my branch, which worked.

index 9a3b248f..ad28f7d1 100644
--- a/src/bindings/lua/Makefile.am
+++ b/src/bindings/lua/Makefile.am
@@ -54,7 +54,8 @@ check_LTLIBRARIES = \

 noinst_LTLIBRARIES = \
 	libfluxlua.la \
-	lalarm.la
+	lalarm.la \
+	libhostlist.la

 luamod_ldflags = \
 	-avoid-version -module -shared --disable-static \
@@ -123,6 +124,10 @@ lalarm_la_LDFLAGS = \
 lalarm_la_LIBADD = \
 	$(LUA_LIB)

+libhostlist_la_SOURCES = \
+	lua-hostlist/hostlist.c \
+	lua-hostlist/hostlist.h
+

@grondo: Do you want flux_host_la to rely on this convenience library as well? Or just keep it as is?

@grondo
Copy link
Contributor Author

grondo commented May 30, 2019

Yes, if you can replace lua-hostlist/hostlist.[ch] in flux_hostlist_la_SOURCES with libhostlist.la that would be cleaner I think.

@dongahn
Copy link
Member

dongahn commented May 30, 2019

Got it.

@dongahn
Copy link
Member

dongahn commented May 30, 2019

Hey @grondo and @SteVwonder: I am using log_msg to print debug message for flux-job-debug like

#define DEBUG(fmt,...) do { \
      if (verbose) log_msg(fmt, ##__VA_ARGS__); \
} while (0)

I believe flux-job-debug is one of those commands for which printing a timestamp per each line will be very useful. Is there a variant of log that will add timestamp? Otherwise, I will add my timestamp as the prefix to fmt

@dongahn
Copy link
Member

dongahn commented May 31, 2019

OK. A sign of life! I was able to debug 64 MPI hello world using totalview! Obviously a lot more testing to do but the initial result is encouraging.

Screen Shot 2019-05-30 at 9 14 57 PM

BTW, it seems this will be a real bug hunting as mpi hello world program compiled with mpicc was hung for this on quartz. This and OpenMPI issues recently reported by @damora and @SteVwonder suggest that we will need more MPI testing coverage...

@dongahn
Copy link
Member

dongahn commented May 31, 2019

@grondo: Attach mode mostly worked but there still is a race between proctable generation and job-debug fetching the proctable.

quartz770{dahn}121: /g/g0/dahn/workspace/flux_tool/bin/flux job-debug --attach -v 16
flux-job-debug: resource.hosts (quartz[770,770])
97051
flux-job-debug: flux_event_publish_pack: wreck.16.proctable succeeded.
flux-job-debug: job state: running
flux-job-debug: nnodes (2), ntasks (64), ncores (64), ngpus (0)
flux-job-debug: totalview_jobid (16)
flux-job-debug: MPIR_proctable_size (64)
flux-job-debug: Rank (0): exec (/g/g0/dahn/workspace/flux_tool/bin/hw), pid (96830), nodeid (0)
flux-job-debug: flux_kvs_lookup_get_unpack for procdesc: No such file or directory

The code is here:

log_err_exit ("flux_kvs_lookup_get_unpack for procdesc");

I used a WAITCREATE watch on the rank 0 to break any race and then plain kvs lookup. I may be missing something here, though. @grondo: can you be a second pair of eyes?

If there is indeed no guarantee from the producer, I can use a WAITCREATE watch for every task at the cost of performance overhead.

@grondo
Copy link
Contributor Author

grondo commented May 31, 2019

Ok, I will take a look.

@grondo
Copy link
Contributor Author

grondo commented May 31, 2019

@dongahn, if I understand the code correctly it appears you are calling flux_kvs_lookup() on all lwj.x.y.z.<taskid>.procdesc KVS keys in one loop, but only applying FLUX_KVS_WAITCREATE to the lookup for taskid == 0.

Since you initiate the lookups all at once, some of the non-rank-0 lookups are pretty much guaranteed to fail with ENOENT.

I would either add the WAITCREATE flag to all lookups, or only issue the rank 0 lookup at first, and have the fulfillment of that future trigger the remainder of the lookups.

I may not completely understand what is going on in fill_mpir_proctable() though, so please correct me if I misread something.

@dongahn
Copy link
Member

dongahn commented May 31, 2019

Ah thanks. Yes you are correct. I misunderstood how flux future works. There is no ordering among asyn requests like this. Makes perfect sense for performance.

I will hoist up the rank 0 request out of the loop as a "blocking" WAITCREATE and move onto the main loop!

@dongahn
Copy link
Member

dongahn commented May 31, 2019

@grondo: For attach testing, I wanted to start up wreckrun in the background (e.g., flux wreckrun -N 4 -n 64 ../../../bin/hw, Ctrl-Z, bg sequence). I'm getting

[1]  + Suspended (tty input)         flux wreckrun -N 4 -n 64 ../../../bin/hw

I initially thought this was because wreckrun still wants to access stdin from the terminal. But

flux wreckrun --input=/dev/null:all -N 4 -n 64 ../../../bin/hw &

stil exhibit the issue.

@grondo
Copy link
Contributor Author

grondo commented May 31, 2019

For now you might want to use -d, --detach option. (though ctrl-z suspend with --input=/dev/null should work, I'll look at that quickly)

@dongahn
Copy link
Member

dongahn commented May 31, 2019

For now you might want to use -d, --detach option.

#10. This fixed?

@grondo
Copy link
Contributor Author

grondo commented May 31, 2019

#10. This fixed?

Yes, in my current PR branch.

BTW, I can't reproduce the problem above with ctrl-z; bg. I'm even able to background a wreckrun without -i /dev/null. I wonder what is the difference?

@dongahn
Copy link
Member

dongahn commented May 31, 2019

Hmm. I am using tcsh on quartz. I saw this issue on my Mac osx docker as well.

quartz770{dahn}24: flux wreckrun -i /dev/null -N 1 -n2 sleep 20 &
[2] 41666
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
quartz770{dahn}25:
[2]  + Suspended (tty input)         flux wreckrun -i /dev/null -N 1 -n2 sleep 20

@grondo
Copy link
Contributor Author

grondo commented May 31, 2019

Ah, yes, I can occasionally reproduce a problem if I try to run flux-wreckrun directly in the background. Maybe also try redirecting input of flux-wreckrun from /dev/null, e.g. flux wreckrun -i /dev/null -N1 -n2 sleep 20 < /dev/null

@dongahn
Copy link
Member

dongahn commented May 31, 2019

I did some manual smoke tests with totalview and the current version works great (w/o server bulk launch of course). I've asked @lee218llnl to suggest a STAT installation for which I can modify the configuration files for manual testing as well. Given the testing results I got so far, I will soon go ahead and add some sharness tests with a goal to land this PR sooner rather than later.

@dongahn
Copy link
Member

dongahn commented May 31, 2019

I've asked @lee218llnl to suggest a STAT installation for which I can modify the configuration files for manual testing as well.

Well, I looked at the version and realized STAT/LaunchMON can't launch tools deamons without bulk launch capability. So, I will work on sharness test cases first. Then when my PR gets merged with @grondo's wreckrun --jobid support, I will test STAT (as well as totalview's server bulk launch).

@dongahn
Copy link
Member

dongahn commented May 31, 2019

@grondo:

Kind of a convoluted corner case:

I start wreckrun with --options=stop-children-in-exec to see if I can attach totalview when the state is sync.

flux-wreckrun --options=stop-children-in-exec -N 2 -n 32 ./hw &

job-debug allows totalview to attach to the job in the sync state and unlock the processes from the initial barrier. But after this happened, wreck still reports the job state as sync.

quartz31{dahn}47: flux wreck ls
    ID NTASKS STATE                    START      RUNTIME    RANKS COMMAND/NAME
     3      2 sync       2019-05-31T14:01:41       2.118m    [0-1] hw

I think this is probably job-debug didn't send the SIGCONT signal.

I tested this for the case where the initial job-debug session somehow failed and users want to attach to this job using another job-debug session. (Would be very very rare).

I don't know if I want job-debug to send the extra SIGCONT signal to wreck for this case. Not sure if that will be safe.

@dongahn dongahn closed this as completed Jun 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants