-
Notifications
You must be signed in to change notification settings - Fork 667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional failures in MPI part of the unit tests on ARM neoverse_v1 #334
Comments
I followed this issue here from the EESSI repo. I'm trying to reproduce, but I haven't been able to do so . I've tried gcc 13.2.0, with Open MPI 4.1.6 and Open MPI 4.1.5. I'm running on an AWS hpc7g instance (ubuntu 2204). After being unable to reproduce directly from fftw source, I tried the following easybuild:
which is based on trying to reproduce https://gist.github.com/boegel/d97b974b8780c93753f1bf7462367082. After the build, I can run |
One observation I have is that all the failures I've seen reported are from mpi-bench. It is true that mpirun may do slightly different things when it detects that it is running as part of a Slurm job. Can you provide any detail about how the slurm job is allocated or launched? |
I'm not sure of the exact job characteristics for the test build reported in https://gist.github.com/boegel/d97b974b8780c93753f1bf7462367082 For the builds done in EESSI I also couldn't tell you exactly what resources were requested in the job. But: this is run in a container, and then in a shell in which the only SLURM related job variable that is set is the I did notice that I had fewer failures when I did the building interactively (though still in a job environment, it was an interactive SLURM job), as mentioned here. That seems to confirm that somehow environment has an affect, but... I couldn't really say what. This is a hard one :( |
Hm, I suddenly realize one difference between our bot building for EESSI, and your typical interactive environment: the bot not only builds in the container, it builds in a writeable overlay in the container. That tends to be a bit sluggish in terms of I/O. I'm wondering if that can somehow affect how these tests run. It's a bit far-fetched, and I wouldn't be able to explain the mechanism that makes it fail, but it would explain why my own interactive attempts showed a much higher success rate. |
Hm, in that container I wonder how many CPUs were allocated to it? I saw it was configured to allow oversubscription, I guess there is probably only 1 CPU core, which is different from my testing... |
Our build nodes in AWS have 16 cores ( Not sure what @casparvl used for testing interactively |
Is there a way for me to get access to that build container so I may try it myself? |
Yes, it's part of https://github.com/EESSI/software-layer . Your timing is pretty good, I very recently made a PR to our docs to explain how to use it to replicate build failures. PR isn't merged yet, but it's markdown, so you can simply view a rendered version in my feature branch. Links won't work in there, but I guess you can find your way around if need be - though I think this one markdown doc should cover it all. |
Btw, I've tried to reproduce it once again, since we now have a new build cluster (based on Magic Castle instead of Cluster in the Cloud). I've only tried interactively (basically following the docs I just shared), and I cannot for the life of me replicate our own issue. As mentioned in the original issue, interactively I had much higher success rates (9/10 times more or less), but I've ran
at least 20 times without failures now. I'd love to see if the error still occurs when the bot builds it (as there it was consistently failing before), but my initial attempt failed for other reasons (basically, the bot cannot reinstall anything that already exists in the EESSI software stack - if you try, it'll fail on trying to change permissions on a read only file). I'll check with others if there is something I can do to work around this, so that I can actually trigger a rebuild with the bot. |
Yeah, I ran it over 200 times without failure on my cluster. Thank you for the pointers in that doc PR. I'll use that to try and trigger it again. |
@casparvl Should I temporarily revive a node in our old CitC Slurm cluster, to check if the problem was somehow specific to that environment? |
@casparvl I haven't had the time to reproduce within a container. Are we still seeing the testing failures occur or is it not happening on the newer build cluster? |
I am still seeing this problem on the our build cluster, when doing a test installation (in an interactive session) of A first attempt resulted in a segfault:
A 2nd attempt showed
|
I tried to replicate this over the weekend. @casparvl's documentation was extremely helpful, thank you! I tried to debug this PR: https://github.com/EESSI/software-layer/pull/374/files
And then within the easybuild container did this in a loop:
It ran 374 times over the weekend without failure on an hpc7g.16xlarge (64 cores). @casparvl sounded like you suspected a writable overlay could cause more slugish I/O. I'm not familiar enough with eessi container, but I think with the access rw I have done that, correct? Do either of you have other ideas for me to change? I suppose I can switch to a c7g.4xlarge.... |
I was able to compile and successfully run on c7g.4xlarge as well, with no issues there either. |
@casparvl Do you have other ideas on how I can try to reproduce? I'm not sure if it matters, but my attempt was on an Ubuntu 2004 and the container was started using: My repeated testing was repeated calls of |
Sorry for failing to come back to you on this. I'll try again myself as well. I just did one install, which indeed was succesfull. Second time, I ran into the same error as @boegel had the 2nd time around:
Running it a third time, it completed succesfully again. The only thing you don't mention explicitely is if you also followed the steps of activating the prefix environment & EESSI pilot stack, as described on https://www.eessi.io/docs/adding_software/debugging_failed_builds/ , and if you sourced the If you didn't I guess that means you've built the full software stack from the ground up. If that's the case, and if that works, then I guess the conclusion is something is fishy with one of the FFTW.MPI dependencies we pick up from the EESSI pilot stack (and for which you would have done a fresh build). That's useful information, because it would show that the combination of using the dependencies from EESSI somehow trigger this issue. Also, it'd mean you could actually try those steps as well (i.e. start the prefix environment, start the EESSI pilot stack, source the Just for reference, this is a snippet of my
The result of
Curious to hear if you ran using the EESSI pilot stack for dependencies. Maybe you can also share your |
I'm also still puzzled by the randomness of this issue. I'd love to better understand why the failrue of these tests are random. Is the input randomly generated? Is the algorithm simply non-deterministic (e.g. because of non-deterministic order in reduction operations or something of that nature)? I'd love to understand if that 'randomness' could somehow be affected by environment, as initially I seem to have seen many more failures in a job environment than interactively... But I'm not sure if any of you has such an intricate knowledge of what these particular tests do :) |
Yes, I'm afraid I can't speak for the fftw developers here, perhaps @matteo-frigo could help answer the question about what |
My complete steps are here:
Sadly I didn't save my easybuild output, let me re-create again. I am curious, when you "retry" do you retry from |
Ok, so you also built on top of the dependencies that were already provided from the EESSI side. Then I really don't see any differences, other than (potentially) things in the environment... Strange!
Like you, I retried from Also interesting, I've tried a 4th time. Now I get a hanging process. I.e. I see two |
I would love a backtrace of both of those processes! |
Great idea... but unfortunately my allocation ended 2 minutes after I noticed the hang :( I'm pretty sure I had process hangs before as well, when I ran into this issue originally. I'll try to run it a couple more times tonight, see if I can trigger it again and get a backtrace... |
Hm, while trying to reproduce my hang (which I didn't succeed in yet), I noticed something: the automatic initialization script from EESSI thinks this node is a Anyway, for now, I'll override myself with |
Interesting, now that I correctly use the right dependencies (due to I've run it about 10-15 times now. Each time, it fails with a numerical error like the one above. Now, finally, I've managed to reproduce the hanging 2 processes. Here's the backtrace:
|
Our bot indeed overrides the CPU auto-detection during building, because In |
Seems like we (you) are making progress! I tried to add your override. Here is my eb config:
But I still don't get failures during testing. I do think allreduce has the potential to be non-deterministic, however I'm unsure if the I wonder, is there a way for me to continually run the test without rebuilding each time? |
It is possible. What you could do is stop the EasyBuild installation after a certain point using the
This should stop it after the build step (and before the test step). Then, you'd want to run
This will dump a script Then, source that |
By the way, your I'm absolutely puzzled by why things are different for you than for us. Short from seeing if we could have you test things on our cluster, I don't know what else to try for you to reproduce the failure... :/ I that's something you would be up for, see if you can reach out to @boegel on the EESSI Slack in a DM (join here if you're not yet on that channel), he might be able to arrange it for you. @boegel maybe you could also do the reverse: spin up a regular VM outside of our Magic Castle setup and see if you can reproduce the issue there? If not, it must be related to our cluster setup somehow... Also a heads up: I'm going to be on quite a long leave, so won't be able to respond for the next month or so. Again, maybe @boegel can follow up if needed :) |
Thank you for the testing insight and the slack invite. Enjoy the break. I'll talk to @boegel on slack and see what he thinks is a reasonable next step. |
@lrbison When would you like to follow up on this? |
I talked offline with Kenneth. In the mean time, my pattern-matching neurons fired: both #334 (comment) and https://gitlab.com/eessi/support/-/issues/24#note_1734228961 have something in common: Both are in mca_btl_smcuda_component_progress from the smcuda module, but I recall smcuda should really only be engaged when CUDA/ROCm/{accelerator} memory is used, otherwise we should be using the SM BTL. I'll follow up on that. Another similarity is that although the fftw backtrace is just form a sendrecv, the hang was stopped during allreduce, and both OpenFOAM and FFTW cases were doing ompi_coll_base_allreduce_intra_recursivedoubling. However my gut tells me it's not the reduction at fault but rather the progress engine, (partially because I know for a fact we are testing that allreduce function daily without issue). |
Moving the rest of this discussion to https://gitlab.com/eessi/support/-/issues/41 |
The root cause was open-mpi/ompi#12270 Fixed in open-mpi/ompi#12338, so this issue can be closed. |
For Neoverse V1 users, if you can also try and report on the release-for-testing in #315 it would be useful to get SVE support upstream. |
Closing as requested. |
I've build FFTW on an ARM neoverse_v1 architecture. However, when running the test suite (
make check
) I get occasional failures. The strange thing is: they don't happen consistently, and they don't always happen in the same test. Two example (partial) outputs I've had:i.e. an error in the 3-CPU tests. And:
i.e .a failure in the 4-CPU part of the tests.
When run interactively, I seem to get these failures about 1 out of 10 times. I also experience the occasional hang (looks like a deadlock, but I'm not sure).
We also do this build in an automated (confinuous deployment) environment, where it is build within a SLURM job. For some reason, there, it always seems to fail (or at least the fail rate is high enough that 5 attempts haven't led to a successful run).
My questions here:
The text was updated successfully, but these errors were encountered: