Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hybrid OpenMP/MPI #1627

Closed
stevengj opened this issue Jun 23, 2021 · 12 comments · Fixed by #1628
Closed

Hybrid OpenMP/MPI #1627

stevengj opened this issue Jun 23, 2021 · 12 comments · Fixed by #1628

Comments

@stevengj
Copy link
Collaborator

It would be good to revive the OpenMP work (see #228 and #1234), and in particular to support hybrid parallelism — use shared-memory threads (via OpenMP) within a node, and MPI across nodes, in a parallel cluster.

One big motivation is to save memory for the case of 3d design grids. We replicate the material grid (as well as several similar-size data structures for gradients and other vectors internal to NLopt) for every MPI process — this would allow us to have only one copy per node of a cluster.

In particular @mochen4 is now doing 3d topology optimization for 3d-printed structure, and we are finding that we are starting to run out of memory on machines with 40 cpus per node, since the 3d material grid (which can be gigabytes in size) gets replicated 40 times.

@smartalecH, did you have a branch that could be rebased here?

@smartalecH
Copy link
Collaborator

It would be good to revive the OpenMP work (see #228 and #1234), and in particular to support hybrid parallelism — use shared-memory threads (via OpenMP) within a node, and MPI across nodes, in a parallel cluster.

I actually implemented hybrid openMP/MPI support for a class project last year. My branch is here.

In short, I updated all the LOOP macros to support openMP. I had to make several other changes to prevent things like false sharing. Unfortunately, the performance was almost always worse with OpenMP. Here is a report describing everything I did, along with several numerical experiments. Although I'm sure a more careful approach would yield better results.

this would allow us to have only one copy per node of a cluster.

I think we discussed implementing this feature when we add the backprop step to the new subpixel averaging feature. We need to revamp quite a few things for this to work (e.g. storing the forward run fields efficiently even after we reset meep to do an adjoint run). Allocating the initial material grid on just one node in python is a bit hairy too (note that cluster job managers typically allocate memory/proc -- it's somewhat tricky to allow proc memory allocations to spill over each other even within the same node). Would the optimizer just run on one proc? Either way, memory management gets tricky with that approach too.

But I agree, it's an important feature.

@stevengj
Copy link
Collaborator Author

The point is that if we use OpenMP to multi-thread within the node, and only use MPI across nodes, then it will automatically allocate the material grid once per node and have one NLopt instance per node — from Meep's perspective, it is really just a single process except that some of the loops are faster.

I think this is worth having even if it is currently somewhat slower than pure MPI, because it allows vastly better memory efficiency (for large material grids) with minimal effort.

(An alternative would be to use explicit shared-memory allocation, e.g. with MPI.Win.Create() in mpi4py, plus a whole bunch of additional synchronization to ensure that we only use one NLopt instance per node. (NLopt also allocates data proportional to the number of DOF!) Worse, this synchronization would leak into user code — the user Python scripts would have to be carefully written to avoid deadlocks if they are not doing the same things on all processes.)

@smartalecH
Copy link
Collaborator

then it will automatically allocate the material grid once per node and have one NLopt instance per node

Got it. My branch should be able to do exactly what you are describing.

even if it is currently somewhat slower than pure MPI

I'd say it's worse than "somewhat slower". Note that for some test examples (on a single node with 40 cores) MPI achieved 20x speedup, while openMP would only achieve 5x speedup. Plus, the initialization process was significantly slower for openMP (even though I wrapped all of those loops too).

That being said, maybe @mochen4 can try using my branch as an initial investigation. It only needs minor rebasing.

We can then (hopefully) identify the bottlenecks and resolve them after some experimentation. @oskooi it might be worth pulling in some extra resources toward investigating this. Especially since I already have a rather mature starting point and we only need to fine-tune at this point.

Worse, this synchronization would leak into user code — the user Python scripts would have to be carefully written to avoid deadlocks

I agree, this sounds painful...

@mochen4
Copy link
Collaborator

mochen4 commented Jul 15, 2021

On MIT Supercloud, configured with both "with-openmp" and "with-mpi", the following make check python tests failed (using RUNCODE in each case accordingly #1671) :

1 process and 1 thread:
adjoint_jax, mode_coeffs, mode_decomposition, source: SegFault
simulation: under "with-mpi", it required more than 1 process

@unittest.skipIf(not mp.with_mpi(), "MPI specific test")
def test_mpi(self):
self.assertGreater(mp.comm.Get_size(), 1)

2 process and 1 thread:
Same SegFault tests as above.

1 process and 2 thread:
adjoint_jax, get_point, mode_decomposition, multilevel_atom, source, user_defined_material: Segfault
adjoint_solver, bend_flux, material_grid, simulation: AssertionError (i.e. results mismatch)
mode_coeffs: Singular matrix in inverse

2 process and 2 thread:
Same as above

With the master branch, adjoint_jax, mode_coeffs, mode_decomposition, and source also failed with SegFault.
All C++ tests passed in all cases.

@stevengj
Copy link
Collaborator Author

If you add --enable-debug to the configure flag, can you get a stack trace from the segfault to find out where the crash is? See https://www.open-mpi.org/faq/?category=debugging#general-parallel-debugging

@stevengj
Copy link
Collaborator Author

It might be possible to automatically print a stacktrace on a segfault: https://stackoverflow.com/questions/77005/how-to-automatically-generate-a-stacktrace-when-my-program-crashes

@stevengj
Copy link
Collaborator Author

See if you can reproduce the problem on your local machine rather than supercloud.

@stevengj
Copy link
Collaborator Author

@yugeniom
Copy link

Hello there! Let me thank you so much for all the effort you put in Meep; it enabled me to speed up thing a lot on the HPC cluster I use, with respect to the licence-number limitations of Lumerical.
I am presently working on topology optimizations with 3d grids using pymeep; I read something about a possible hybrid MPI/OpenMP support implementation in Meep; the OpenMP support in addition to MPI would help really so much in the 3d problems I am treating now, since I am typically constrained to use 1/4th of the available cores in each node before performance degradation and memory exaustion. Is such implementation already available in the nightly build of pymeep? I would really be glad to test it and report issues.

@smartalecH
Copy link
Collaborator

Is such implementation already available in the nightly build of pymeep

No, but you can build off of the branch in #1628 to test the preliminary features there. While you'll save on memory, there's still quite a bit of improvement needed to get the OpenMP computational scalability up to where it is currently with pure MPI.

@yugeniom
Copy link

Does the branch in #1628 already include parallelization of the whole simulation, time-stepping included? If I have the possibility I'll try and compile it, but the demanding situation I need to test it for - large area 3d optimizations - needs me to use a HPC cluster, but I don't have root access, that's why I asked for pymeep...
BTW from some job scheduling tests, my large-area 3d optimization look badly limited by local Memory Bandwith within each node, in such a way that I obtain best runtime performance by limiting to 10/20 processes on a 64-core(HT)//2-socket node, even with local memory binding and process-to-core pinning; do you think your OpenMP implementation could help with that, if used hybrid-way with MPI?
In any case thank you for the fast answer!

@smartalecH
Copy link
Collaborator

Does the branch in #1628 already include parallelization of the whole simulation, time-stepping included

Yes, that's the core contribution of the PR, actually. Other forms of parallelization (e.g. simulation initialization) are actually not yet implemented due to thread-safety issues.

I don't have root access

You don't need root access to compile meep. I've actually compiled meep on over a dozen HPC clusters myself (without root access), all for large-scale, 3D TO.

limited by local Memory Bandwith

The FDTD algorithm is inherently memory bound, so there's only so much you can do here (unless you want to implement some fancy updating schemes that can cache multiple steps).

I obtain best runtime performance by limiting to 10/20 processes on a 64-core(HT)//2-socket node

Choosing the proper hardware configuration is a longstanding issue, as you must carefully balance your computation and communication loads, which ultimately requires some extensive trial-and-error on the platform you are using.

do you think your OpenMP implementation could help with that

Maybe. Not with the current branch though. As I said, a lot of effort is still needed in order to achieve performance even on the same level as the current MPI build. The biggest advantage to the hybrid approach right now is reducing the number of duplicate python instances, each with its own set of variables (e.g. 3D design variables).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants