-
Notifications
You must be signed in to change notification settings - Fork 864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opal_lifo test fails on s390x #10988
Comments
@opoplawski I cannot find the error in the build log. Could you please provide some more info on your environment? What compiler are you using? |
Ah, shoot. linked the wrong build. This is Fedora Rawhide - gcc 12.2.1 https://kojipkgs.fedoraproject.org//work/tasks/4918/93474918/build.log |
Thanks! Any chance you can access |
That's the output I pasted in the first comment. |
I see. The tests run succesfully on Summit with GCC 12.1.0 (the latest GCC available on that machine) but I'm not even sure that's the same architecture. I don't have access to any other IBM machine. |
Can you build in a Fedora Rawhide mock environment on that machine? Looks like I have access to some kind of s390x test machine if there are any particular things you'd like me to try.
|
Any chance you could try a different compiler like LLVM? Also, we recently switched from C11 atomics being the default to GCC builtin atomics. Could you try running with C11 atomics instead by passing Summit has Power9 CPUs, so different architecture. |
So, some data points:
|
I'll also note that the test succeeds in 4.1.4 - but maybe the test has changed in 5.0. |
I've managed to reproduce it on our test machine in case there are any other local tests you would like me to run. |
Still present in 5.0.0rc10 |
Still present in 5.0.0 final |
@opoplawski Realistically, I don't know who is going to fix this. I don't know if anyone has ever run an MPI job on an IBM s390 mainframe. I don't think anyone in the known community has the resources to fix this. |
I tried to reproduce this on current |
@opoplawski Is this still an issue in Fedora rawhide? I tried again with a s390x rawhide emulated docker container and couldn't reproduce this error. |
Well, 5.0.1 still failed - see the s390x build.log from https://koji.fedoraproject.org/koji/buildinfo?buildID=2336590
I can't manage to build the rpm from a github tarball. autogen fails with:
|
I don't believe OMPI supports that approach, if you are talking about the GitHub tarballs they attach to the repo tags. I've had a rare request/discussion about that and believe it traces to the use of submodules, which leaves some dangling connections in the GitHub tarball (since it is literally just created using That said, this specific error is one I encountered elsewhere and resolved by executing |
Since we already build with external libs, I'm getting around the submodule issue with:
Thanks for the suggestion, but the |
I got me a free instance on the IBM community cloud but still not luck reproducing this. Since there are only two cores available and this seems to be some multi-threading/atomic issue I might not be able to trigger it there.
Interestingly, when building with clang 17 on that system
To summarize what I observed:
I am running out of time to spend on this, unfortunately. And finding proper docs on this architecture is tedious. Either someone who cares about s390x (anyone from IBM?) will pick it up or OMPI stays broken on that arch. Sorry. |
Sadly, that won't take care of it - the problem is that there is another submodule attached to the You should check to see if the GitHub tarball populates that directory. Pretty sure it doesn't, and that is why you are hitting all those errors. |
I re-discovered the "nightly" tarballs - https://www.open-mpi.org/nightly/main/ and that works for me. FWIW - build on s390x with failure: https://kojipkgs.fedoraproject.org//work/tasks/1335/111371335/build.log |
@joseemoreira Can you help here? |
Hello. Sorry for my delay in responding. I was not aware of this issue until a colleague from IBM just pointed it to me. I have to find the right person in our System z development team to address this. Will get back to you all soon. PS: Do I need to do something so that issues like this show up in my Dashboard? |
I think an @-mention will just send you an email (depending on what your github notification settings for this org are). I just assigned the issue to you, so perhaps it will show up in your dashboard now...? |
Looking at updating the Fedora openmpi package to 5.0.0rc9. I'm getting the following test failure on s390x:
Full log here (for a few days at least): https://kojipkgs.fedoraproject.org//work/tasks/3745/93473745/build.log
The text was updated successfully, but these errors were encountered: