Intel libraries #1750

yugeniom · 2021-08-31T15:19:25Z

Hello there,
So far I did some timing tests on my meep topology optimization code, using just 4 nodes with the conda-package-included mpich implementation of MPI. So far my predicted timing to get enough iterations is well beyond the maximum allowed walltime-per-job on my machine; I noticed around that IntelMPI implementation of MPI could offer some good speedup over many nodes; has anyone ever tried to source-compile pymeep with it? In case which is the newest version you could have meep apparently working with?

oskooi · 2021-08-31T16:29:50Z

For a significant speedup in the timestepping rate when using the adjoint solver for inverse design and topology optimization, you can try the three new features added in Meep 1.20: (1) single-precision floating point for the fields arrays, (2) decimation of the DFT field updates, and (3) memory locality for the step-curl updates via loop tiling. Since the Meep Conda package for the 1.20 release is built using double-precision floating point, to use (1) you will need to compile Meep from source using the --enable-single flag. Benchmarking tests on AWS EC2 have demonstrated that these combined features can provide for more than a factor of 5X speedup in the timestepping rate for large 3d problems.

Separately, it would still be useful to investigate the performance impact of the various MPI libraries for single and multi-node MPI clusters.

smartalecH · 2021-08-31T19:54:54Z

I noticed around that IntelMPI implementation of MPI could offer some good speedup over many nodes

I spent significant time doing this and didn't see any noticeable gain in performance.

So far my predicted timing to get enough iterations is well beyond the maximum allowed walltime-per-job on my machine

Maybe try checkpointing your optimizations so you can run where you left off using multiple job submissions.

(2) decimation of the DFT field update

This isn't quite ready for the adjoint simulation.

yugeniom · 2021-08-31T19:58:39Z

For a significant speedup in the timestepping rate when using the adjoint solver for inverse design and topology optimization, you can try the three new features added in Meep 1.20: (1) single-precision floating point for the fields arrays, (2) decimation of the DFT field updates, and (3) memory locality for the step-curl updates via loop tiling. Since the Meep Conda package for the 1.20 release is built using double-precision floating point, to use (1) you will need to compile Meep from source using the --enable-single flag. Benchmarking tests on AWS EC2 have demonstrated that these combined features can provide for more than a factor of 5X speedup in the timestepping rate for large 3d problems. Separately, it would still be useful to investigate the performance impact of the various MPI libraries for single and multi-node MPI clusters.

Hi,
Thank you for the suggestion. I'll definitely try and compile meep with intel MPI as soon as I get access to the machine licensed with it, I'll let you know my impressions on a possible speed-up. Do you suggest any specific C flags which could be good to compile with for best performance of meep?
In the meanwhile, let me ask some more on your suggestions:
(2) is there some rule of thumb to choose a decimation without impacting on the accuracy, or should I run tests for every different simulation?
(3)I am already using meep 1.20; are the improvements on the step-curl updates active by default in it or should I activate this feature by hand somehow?

yugeniom · 2021-08-31T20:07:53Z

I spent significant time doing this and didn't see any noticeable gain in performance.

How far did you go with the number of nodes and processes? from the data here aws/aws-parallelcluster#1436, a possible speedup looks growing with the number of processes involved. Consider that I'll try running on up to ~500 nodes for my need... Could some trial and error on the possible C flags to specify during compiling be worth, in your opinion?

Maybe try checkpointing your optimizations so you can run where you left off using multiple job submissions.

Oh thank you very much! I didn't know meep also allowed for checkpointing easily! That's great! Can I do that also in the context of adjoint-method optimizations without issues?

This isn't quite ready for the adjoint simulation.

May I ask you why? In any case is some kind of automatic decimation already active by default? (I don't remeber, I may have just read it somewhere here); should I disable it in such a case, for adjoint-method optimiztions?

oskooi · 2021-09-01T05:45:50Z

Do you suggest any specific C flags which could be good to compile with for best performance of meep?

#1628 recently added support for multithreading via OpenMP for the fields update. You can therefore (in addition to --enable-single) also try compiling using --with-openmp and specifying the environment variable OMP_NUM_THREADS at runtime, e.g., env OMP_NUM_THREADS=2 mpirun -np 2 python foo.py. Also, you may want to experiment with compiling with --disable-portable-binary in order to let your compiler potentially take advantage of your native architecture; for more details on this flag see: https://meep.readthedocs.io/en/latest/Build_From_Source/#meep.

(2) is there some rule of thumb to choose a decimation without impacting on the accuracy, or should I run tests for every different simulation?

The optimal decimation factor which takes into account the band-limited nature of your (pulsed) source and monitor bandwidth is chosen for you automatically by default (#1732) except if you are using the adjoint solver (#1751). In that case and until #1751 is resolved, you will need to manually set the decimation_factor when defining the OptimizationProblem object. Some experimentation is required: too small a decimation_factor and you will likely not see much performance improvement whereas too large a value will cause aliasing and degrade the accuracy.

(3)I am already using meep 1.20; are the improvements on the step-curl updates active by default in it or should I activate this feature by hand somehow?

The loop tiling feature is part of the Meep 1.20 release. It is disabled by default. To use it, try setting loop_tile_base=10000 when defining the Simulation object. See the documentation for loop_tile_base which was recently added to https://meep.readthedocs.io/en/latest/Python_User_Interface/#simulation.

yugeniom · 2021-09-08T11:44:00Z

Maybe try checkpointing your optimizations so you can run where you left off using multiple job submissions.

Hi, I am giving a look into this; is checkpointing somehow build-in or should I trying setting it up? In case do you have suggestions on how to proceed in the case of an adjoint optimization? Maybe I should post this here https://github.com/stevengj/nlopt/issues ?

Thanks

smartalecH · 2021-09-08T12:09:07Z

No that's something you'll have to set up yourself.

yugeniom · 2021-09-08T12:19:18Z

No that's something you'll have to set up yourself.

I see; I'll give a try; do you foresee any possible issues in implementing it in the context of the adjoint optimization with nlopt?

stevengj · 2021-09-08T12:20:15Z

Issue seems resolved by this comment.

stevengj closed this as completed Sep 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intel libraries #1750

Intel libraries #1750

yugeniom commented Aug 31, 2021

oskooi commented Aug 31, 2021

smartalecH commented Aug 31, 2021

yugeniom commented Aug 31, 2021 •

edited

Loading

yugeniom commented Aug 31, 2021 •

edited

Loading

oskooi commented Sep 1, 2021

yugeniom commented Sep 8, 2021

smartalecH commented Sep 8, 2021

yugeniom commented Sep 8, 2021

stevengj commented Sep 8, 2021

Intel libraries #1750

Intel libraries #1750

Comments

yugeniom commented Aug 31, 2021

oskooi commented Aug 31, 2021

smartalecH commented Aug 31, 2021

yugeniom commented Aug 31, 2021 • edited Loading

yugeniom commented Aug 31, 2021 • edited Loading

oskooi commented Sep 1, 2021

yugeniom commented Sep 8, 2021

smartalecH commented Sep 8, 2021

yugeniom commented Sep 8, 2021

stevengj commented Sep 8, 2021

yugeniom commented Aug 31, 2021 •

edited

Loading

yugeniom commented Aug 31, 2021 •

edited

Loading