-
Notifications
You must be signed in to change notification settings - Fork 865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error launching/attaching LaunchMON debugger with OpenMPI 2.1.1 #3660
Comments
FWIW, I can launch/attach via LaunchMON with OpenMPI 2.0.3 |
I won't be able to get to this right away - I'm surprised this is still using the usock component, though. I'll have to check what version of pmix is being used. So far as I can tell, all the PR's were taken. However, it is possible that I forgot to file one for the 2.1 series. I'll check that too. Be advised that I plan to gradually move to supporting only PMIx attach methods as we go forward. Not sure if someone else will pickup the non-PMIx methods. Won't be until after 3.0, though, so nothing imminent - just something to plan for the future. |
Hmmm...well, the code supporting the debuggers has clearly not been updated. Not entirely sure why - probably my fault (most likely forgot to file, I suppose). Anyway, I'll create a patch. |
See the referenced PR for fix - please let us know if it resolves the problem. I tested it with both launch_1 and attach_1. |
This looks good to me, thanks! |
This apparently isn't fully fixed yet, according to folks at Allinea:
I don't have time to deal with this one, so I'm assigning it others. |
One further piece of info:
|
@dirk-schubert-arm (Dirk Schubert, DDT developer) Just to confirm: the changes we introduced to the Open MPI master branch in PR #3709 did not fix the problem, and you guys are seeing the problem @rhc54 cited above with the v2.1.2rc tarball (i.e., #3660 (comment) -- based on an email exchange with Xavier). Can you guys test the Open MPI development master to see if the problem occurs there? If so, it may be something we neglected to back-port to our v2.1.x release series. You can find nightly snapshot tarballs from master here: https://www.open-mpi.org/nightly/master/ |
Just to highlight a line from the original report:
|
Correct.
@jsquyres: How urgent is that? Can this wait until Xavier is back next week? If not please let me know and I will try to have a look myself.
From Xavier in our ticket:
But not sure how much "a lot" is. |
Yeah, as @dirk-schubert-arm and @rhc54 pointed out -- I missed that the original report says that it works fine on v3.0.x. It probably also works on master, but it would be good to verify (because we're going to fork from master again soon for v3.1.x). I guess we'll need some help tracking this down in the v2.1.x release series. This can certainly wait until next week. |
Hello,
This is the backtrace from the mpirun core file:
I also tested v3.0.x branch and works fine:
The error that I see on Master is different than the error reported for OpenMPI 2.1.x |
I finally found some time to make a reproducer just with GDB (attached in gdb-only.zip). The good news is that only version 2.1.x seems to be affected by that issue. Version 3.0.0 and master (used nightly build from 8th of October) are fine (I have not tested 3.1.x). The gist of the test: The script "run.sh" runs mpirun in a loop (configurable, default is 100) under GDB (driven by the commands defined in "mpirun.gdb") to count the number of successful / bad runs and keeping a GDB log of all the runs.
Here is how you can "run.sh":
Afterwards you can inspect the individual GDB logs with:
NB: I am not sure if the mpirun issue is the problem or just the symptom of a problem in the individual MPI processes. If I run the reproducer enough, the individual MPI can processes crash. FWIW:
I hope this helps. Let me know if we can assist further. Regards, |
Can we close this issue as Fixed in v3.x series? |
Are there any plans to fix this in a 2.x version? |
@lee218llnl We talked about this today on the weekly teleconf. @hjelmn and @hppritcha are going to investigate and scope the issue. Just for our information: is it possible to upgrade to 3.0.x? |
While it would be nice to see a fix in a 2.x version, I suppose a 3.0.x version will suffice. Think it will make it in time for x=1? If so, when is that going to be released? |
v3.0.1 is anticipated by the end of the month. But testing above shows that this is working with v3.0.0 and v3.1.x and master. |
Agreeing with @lee218llnl that a fix would be nice in 2.x, but 3.0.x will do. My only worry is users of OpenMPI 2.1.x who are not able to upgrade (immediately) to 3.0.x for whatever reason - in the end they are not able to debug/profile their code or (more general) make use any tool relying on the MPIR interface. |
My understanding is that it's not the entire MPIR interface that's broken, but just the part of MPIR that LaunchMON uses to attach to MPI processes. other MPIR tools that launch via MPIR seem to be working. It may be easier to look at what might have regressed since 2.0.x and fix it that way, if anyone has time to look into this. |
It's not just LaunchMON, because our tools - Arm DDT/MAP (formerly known as Allinea DDT/MAP) - have the same problem (based on my understanding) with Open MPI 2.1.x.
Okay. Was the GDB standalone reproducer that I provided on Oct, 20th of any help? |
I'm always intrigued by the different ways people approach a break in code. My approach would be to simply add some print statements to find where the current code is broken, and fix it. IMO, that is much easier than playing all these "how did the code change" games. Hopefully, folks are now beginning to understand better the community decision to move away from MPIR. I'm disturbed that six months into the transition of RTE responsibilities, we still can't get someone to address problems such as this one...but that's a community problem that shouldn't get reflected into the broader user base. FWIW: I recently committed a PMIx-based RTE component into OMPI that eliminates the need for a runtime (except, of course, where no managed environment exists and the PMIx reference server isn't being used). This will significantly reduce the RTE support requirement for most installations. @npe9 and I need to do some testing/debugging, but it should be ready soon. I'll take a crack at fixing this today as my LANL friends tell me that they would appreciate support for the 2.1 series, and I hate seeing an entire OMPI series that doesn't support a debugger. I suspect it isn't a big issue. However, it truly is the last time I can/will do it. @lee218llnl If you would like, I'm happy to help integrate LaunchMON with the PMIx debugger support. |
I don't work directly with LaunchMON, I'm more of a client of it with STAT. @dongahn may want to chime in on this. |
@rhc54: this would be an interesting effort. The key would be to maintain both MPIR and PMIX ports. Probably doable but would need a redesign effort, for which I don't have time to do at this point. Maybe I can get some help from Matt Legendre's team. (don't have his githib id.).. |
adding @mplegendre |
@lee218llnl I just ran test.launch_1 using the current head of the v2.x branch and it worked fine, so perhaps this is already fixed. I don't see a 2.1.3 release candidate out there, but can you clone the OMPI v2.x branch and give it a try? |
@rhc54 I just tested the v2.x branch and it does work for LaunchMON/STAT. FWIW, I looked back at this issue thread and I had previously confirmed your patch. I believe the allinea folks said they were the ones still seeing problems, per your comment on Sept 5. |
Ah, ok - I'll run their test and see what happens. |
Okay, I have confirmed the following:
The problem is that the daemon is attempting to send to the rank=0 app proc before that proc is up and running. This causes a SIGPIPE which gdb is trapping, which subsequently causes the daemon to abruptly exit and results in the proc segfaulting. I'm tracking down why the daemon feels a need to communicate as that is the root cause. Meantime, I am closing this issue as it has nothing to do with the original reported problem, and will open another issue to track this specific problem. |
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
v2.1.1
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
from source tarball
Please describe the system on which you are running
Details of the problem
I am having trouble attaching LaunchMON when using OpenMPI 2.1.1.
Here's how you can reproduce (modify your PATH and the path to mpirun):
In addition, the LaunchMON "test.attach_1" test hangs when trying to attach. @rhc54 had previously helped me with various debugger attach issues and we had a working commit. I don't know if that made it into the release or if this is a new issue. It would be nice if LaunchMON tests could be integrated as part of the release testing.
The text was updated successfully, but these errors were encountered: