Hybrid OpenMP/MPI #1627

stevengj · 2021-06-23T02:39:43Z

It would be good to revive the OpenMP work (see #228 and #1234), and in particular to support hybrid parallelism — use shared-memory threads (via OpenMP) within a node, and MPI across nodes, in a parallel cluster.

One big motivation is to save memory for the case of 3d design grids. We replicate the material grid (as well as several similar-size data structures for gradients and other vectors internal to NLopt) for every MPI process — this would allow us to have only one copy per node of a cluster.

In particular @mochen4 is now doing 3d topology optimization for 3d-printed structure, and we are finding that we are starting to run out of memory on machines with 40 cpus per node, since the 3d material grid (which can be gigabytes in size) gets replicated 40 times.

@smartalecH, did you have a branch that could be rebased here?

smartalecH · 2021-06-23T12:52:45Z

It would be good to revive the OpenMP work (see #228 and #1234), and in particular to support hybrid parallelism — use shared-memory threads (via OpenMP) within a node, and MPI across nodes, in a parallel cluster.

I actually implemented hybrid openMP/MPI support for a class project last year. My branch is here.

In short, I updated all the LOOP macros to support openMP. I had to make several other changes to prevent things like false sharing. Unfortunately, the performance was almost always worse with OpenMP. Here is a report describing everything I did, along with several numerical experiments. Although I'm sure a more careful approach would yield better results.

this would allow us to have only one copy per node of a cluster.

I think we discussed implementing this feature when we add the backprop step to the new subpixel averaging feature. We need to revamp quite a few things for this to work (e.g. storing the forward run fields efficiently even after we reset meep to do an adjoint run). Allocating the initial material grid on just one node in python is a bit hairy too (note that cluster job managers typically allocate memory/proc -- it's somewhat tricky to allow proc memory allocations to spill over each other even within the same node). Would the optimizer just run on one proc? Either way, memory management gets tricky with that approach too.

But I agree, it's an important feature.

stevengj · 2021-06-23T14:08:05Z

The point is that if we use OpenMP to multi-thread within the node, and only use MPI across nodes, then it will automatically allocate the material grid once per node and have one NLopt instance per node — from Meep's perspective, it is really just a single process except that some of the loops are faster.

I think this is worth having even if it is currently somewhat slower than pure MPI, because it allows vastly better memory efficiency (for large material grids) with minimal effort.

(An alternative would be to use explicit shared-memory allocation, e.g. with MPI.Win.Create() in mpi4py, plus a whole bunch of additional synchronization to ensure that we only use one NLopt instance per node. (NLopt also allocates data proportional to the number of DOF!) Worse, this synchronization would leak into user code — the user Python scripts would have to be carefully written to avoid deadlocks if they are not doing the same things on all processes.)

smartalecH · 2021-06-23T14:22:09Z

then it will automatically allocate the material grid once per node and have one NLopt instance per node

Got it. My branch should be able to do exactly what you are describing.

even if it is currently somewhat slower than pure MPI

I'd say it's worse than "somewhat slower". Note that for some test examples (on a single node with 40 cores) MPI achieved 20x speedup, while openMP would only achieve 5x speedup. Plus, the initialization process was significantly slower for openMP (even though I wrapped all of those loops too).

That being said, maybe @mochen4 can try using my branch as an initial investigation. It only needs minor rebasing.

We can then (hopefully) identify the bottlenecks and resolve them after some experimentation. @oskooi it might be worth pulling in some extra resources toward investigating this. Especially since I already have a rather mature starting point and we only need to fine-tune at this point.

Worse, this synchronization would leak into user code — the user Python scripts would have to be carefully written to avoid deadlocks

I agree, this sounds painful...

mochen4 · 2021-07-15T16:09:40Z

On MIT Supercloud, configured with both "with-openmp" and "with-mpi", the following make check python tests failed (using RUNCODE in each case accordingly #1671) :

1 process and 1 thread:
adjoint_jax, mode_coeffs, mode_decomposition, source: SegFault
simulation: under "with-mpi", it required more than 1 process

meep/python/tests/test_simulation.py

Lines 126 to 128 in 89c9349

    
           @unittest.skipIf(not mp.with_mpi(), "MPI specific test") 
        
           def test_mpi(self): 
        
               self.assertGreater(mp.comm.Get_size(), 1)

2 process and 1 thread:
Same SegFault tests as above.

1 process and 2 thread:
adjoint_jax, get_point, mode_decomposition, multilevel_atom, source, user_defined_material: Segfault
adjoint_solver, bend_flux, material_grid, simulation: AssertionError (i.e. results mismatch)
mode_coeffs: Singular matrix in inverse

2 process and 2 thread:
Same as above

With the master branch, adjoint_jax, mode_coeffs, mode_decomposition, and source also failed with SegFault.
All C++ tests passed in all cases.

stevengj · 2021-07-15T18:43:34Z

If you add --enable-debug to the configure flag, can you get a stack trace from the segfault to find out where the crash is? See https://www.open-mpi.org/faq/?category=debugging#general-parallel-debugging

stevengj · 2021-07-15T18:47:28Z

It might be possible to automatically print a stacktrace on a segfault: https://stackoverflow.com/questions/77005/how-to-automatically-generate-a-stacktrace-when-my-program-crashes

stevengj · 2021-07-15T18:49:13Z

See if you can reproduce the problem on your local machine rather than supercloud.

stevengj · 2021-07-15T18:56:16Z

Try the suggestion here: https://github.com/NanoComp/meep/pull/1628/files#r670728520

yugeniom · 2021-08-22T19:35:55Z

Hello there! Let me thank you so much for all the effort you put in Meep; it enabled me to speed up thing a lot on the HPC cluster I use, with respect to the licence-number limitations of Lumerical.
I am presently working on topology optimizations with 3d grids using pymeep; I read something about a possible hybrid MPI/OpenMP support implementation in Meep; the OpenMP support in addition to MPI would help really so much in the 3d problems I am treating now, since I am typically constrained to use 1/4th of the available cores in each node before performance degradation and memory exaustion. Is such implementation already available in the nightly build of pymeep? I would really be glad to test it and report issues.

smartalecH · 2021-08-23T17:39:27Z

Is such implementation already available in the nightly build of pymeep

No, but you can build off of the branch in #1628 to test the preliminary features there. While you'll save on memory, there's still quite a bit of improvement needed to get the OpenMP computational scalability up to where it is currently with pure MPI.

yugeniom · 2021-08-23T18:11:17Z

Does the branch in #1628 already include parallelization of the whole simulation, time-stepping included? If I have the possibility I'll try and compile it, but the demanding situation I need to test it for - large area 3d optimizations - needs me to use a HPC cluster, but I don't have root access, that's why I asked for pymeep...
BTW from some job scheduling tests, my large-area 3d optimization look badly limited by local Memory Bandwith within each node, in such a way that I obtain best runtime performance by limiting to 10/20 processes on a 64-core(HT)//2-socket node, even with local memory binding and process-to-core pinning; do you think your OpenMP implementation could help with that, if used hybrid-way with MPI?
In any case thank you for the fast answer!

smartalecH · 2021-08-23T18:32:26Z

Does the branch in #1628 already include parallelization of the whole simulation, time-stepping included

Yes, that's the core contribution of the PR, actually. Other forms of parallelization (e.g. simulation initialization) are actually not yet implemented due to thread-safety issues.

I don't have root access

You don't need root access to compile meep. I've actually compiled meep on over a dozen HPC clusters myself (without root access), all for large-scale, 3D TO.

limited by local Memory Bandwith

The FDTD algorithm is inherently memory bound, so there's only so much you can do here (unless you want to implement some fancy updating schemes that can cache multiple steps).

I obtain best runtime performance by limiting to 10/20 processes on a 64-core(HT)//2-socket node

Choosing the proper hardware configuration is a longstanding issue, as you must carefully balance your computation and communication loads, which ultimately requires some extensive trial-and-error on the platform you are using.

do you think your OpenMP implementation could help with that

Maybe. Not with the current branch though. As I said, a lot of effort is still needed in order to achieve performance even on the same level as the current MPI build. The biggest advantage to the hybrid approach right now is reducing the number of duplicate python instances, each with its own set of variables (e.g. 3D design variables).

stevengj added the enhancement label Jun 23, 2021

stevengj mentioned this issue Jun 23, 2021

updated multi-threading branch #1628

Merged

oskooi mentioned this issue Aug 9, 2021

Speed up simulation? #1726

Closed

stevengj closed this as completed in #1628 Aug 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hybrid OpenMP/MPI #1627

Hybrid OpenMP/MPI #1627

stevengj commented Jun 23, 2021

smartalecH commented Jun 23, 2021

stevengj commented Jun 23, 2021

smartalecH commented Jun 23, 2021

mochen4 commented Jul 15, 2021

stevengj commented Jul 15, 2021

stevengj commented Jul 15, 2021

stevengj commented Jul 15, 2021

stevengj commented Jul 15, 2021

yugeniom commented Aug 22, 2021

smartalecH commented Aug 23, 2021

yugeniom commented Aug 23, 2021

smartalecH commented Aug 23, 2021

Hybrid OpenMP/MPI #1627

Hybrid OpenMP/MPI #1627

Comments

stevengj commented Jun 23, 2021

smartalecH commented Jun 23, 2021

stevengj commented Jun 23, 2021

smartalecH commented Jun 23, 2021

mochen4 commented Jul 15, 2021

stevengj commented Jul 15, 2021

stevengj commented Jul 15, 2021

stevengj commented Jul 15, 2021

stevengj commented Jul 15, 2021

yugeniom commented Aug 22, 2021

smartalecH commented Aug 23, 2021

yugeniom commented Aug 23, 2021

smartalecH commented Aug 23, 2021