-
Notifications
You must be signed in to change notification settings - Fork 626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hybrid OpenMP/MPI #1627
Comments
I actually implemented hybrid openMP/MPI support for a class project last year. My branch is here. In short, I updated all the LOOP macros to support openMP. I had to make several other changes to prevent things like false sharing. Unfortunately, the performance was almost always worse with OpenMP. Here is a report describing everything I did, along with several numerical experiments. Although I'm sure a more careful approach would yield better results.
I think we discussed implementing this feature when we add the backprop step to the new subpixel averaging feature. We need to revamp quite a few things for this to work (e.g. storing the forward run fields efficiently even after we reset meep to do an adjoint run). Allocating the initial material grid on just one node in python is a bit hairy too (note that cluster job managers typically allocate memory/proc -- it's somewhat tricky to allow proc memory allocations to spill over each other even within the same node). Would the optimizer just run on one proc? Either way, memory management gets tricky with that approach too. But I agree, it's an important feature. |
The point is that if we use OpenMP to multi-thread within the node, and only use MPI across nodes, then it will automatically allocate the material grid once per node and have one NLopt instance per node — from Meep's perspective, it is really just a single process except that some of the loops are faster. I think this is worth having even if it is currently somewhat slower than pure MPI, because it allows vastly better memory efficiency (for large material grids) with minimal effort. (An alternative would be to use explicit shared-memory allocation, e.g. with |
Got it. My branch should be able to do exactly what you are describing.
I'd say it's worse than "somewhat slower". Note that for some test examples (on a single node with 40 cores) MPI achieved 20x speedup, while openMP would only achieve 5x speedup. Plus, the initialization process was significantly slower for openMP (even though I wrapped all of those loops too). That being said, maybe @mochen4 can try using my branch as an initial investigation. It only needs minor rebasing. We can then (hopefully) identify the bottlenecks and resolve them after some experimentation. @oskooi it might be worth pulling in some extra resources toward investigating this. Especially since I already have a rather mature starting point and we only need to fine-tune at this point.
I agree, this sounds painful... |
On MIT Supercloud, configured with both "with-openmp" and "with-mpi", the following 1 process and 1 thread: meep/python/tests/test_simulation.py Lines 126 to 128 in 89c9349
2 process and 1 thread: 1 process and 2 thread: 2 process and 2 thread: With the master branch, adjoint_jax, mode_coeffs, mode_decomposition, and source also failed with SegFault. |
If you add |
It might be possible to automatically print a stacktrace on a segfault: https://stackoverflow.com/questions/77005/how-to-automatically-generate-a-stacktrace-when-my-program-crashes |
See if you can reproduce the problem on your local machine rather than supercloud. |
Try the suggestion here: https://github.com/NanoComp/meep/pull/1628/files#r670728520 |
Hello there! Let me thank you so much for all the effort you put in Meep; it enabled me to speed up thing a lot on the HPC cluster I use, with respect to the licence-number limitations of Lumerical. |
No, but you can build off of the branch in #1628 to test the preliminary features there. While you'll save on memory, there's still quite a bit of improvement needed to get the OpenMP computational scalability up to where it is currently with pure MPI. |
Does the branch in #1628 already include parallelization of the whole simulation, time-stepping included? If I have the possibility I'll try and compile it, but the demanding situation I need to test it for - large area 3d optimizations - needs me to use a HPC cluster, but I don't have root access, that's why I asked for pymeep... |
Yes, that's the core contribution of the PR, actually. Other forms of parallelization (e.g. simulation initialization) are actually not yet implemented due to thread-safety issues.
You don't need root access to compile meep. I've actually compiled meep on over a dozen HPC clusters myself (without root access), all for large-scale, 3D TO.
The FDTD algorithm is inherently memory bound, so there's only so much you can do here (unless you want to implement some fancy updating schemes that can cache multiple steps).
Choosing the proper hardware configuration is a longstanding issue, as you must carefully balance your computation and communication loads, which ultimately requires some extensive trial-and-error on the platform you are using.
Maybe. Not with the current branch though. As I said, a lot of effort is still needed in order to achieve performance even on the same level as the current MPI build. The biggest advantage to the hybrid approach right now is reducing the number of duplicate python instances, each with its own set of variables (e.g. 3D design variables). |
It would be good to revive the OpenMP work (see #228 and #1234), and in particular to support hybrid parallelism — use shared-memory threads (via OpenMP) within a node, and MPI across nodes, in a parallel cluster.
One big motivation is to save memory for the case of 3d design grids. We replicate the material grid (as well as several similar-size data structures for gradients and other vectors internal to NLopt) for every MPI process — this would allow us to have only one copy per node of a cluster.
In particular @mochen4 is now doing 3d topology optimization for 3d-printed structure, and we are finding that we are starting to run out of memory on machines with 40 cpus per node, since the 3d material grid (which can be gigabytes in size) gets replicated 40 times.
@smartalecH, did you have a branch that could be rebased here?
The text was updated successfully, but these errors were encountered: