Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intel libraries #1750

Closed
yugeniom opened this issue Aug 31, 2021 · 9 comments
Closed

Intel libraries #1750

yugeniom opened this issue Aug 31, 2021 · 9 comments

Comments

@yugeniom
Copy link

Hello there,
So far I did some timing tests on my meep topology optimization code, using just 4 nodes with the conda-package-included mpich implementation of MPI. So far my predicted timing to get enough iterations is well beyond the maximum allowed walltime-per-job on my machine; I noticed around that IntelMPI implementation of MPI could offer some good speedup over many nodes; has anyone ever tried to source-compile pymeep with it? In case which is the newest version you could have meep apparently working with?

@oskooi
Copy link
Collaborator

oskooi commented Aug 31, 2021

For a significant speedup in the timestepping rate when using the adjoint solver for inverse design and topology optimization, you can try the three new features added in Meep 1.20: (1) single-precision floating point for the fields arrays, (2) decimation of the DFT field updates, and (3) memory locality for the step-curl updates via loop tiling. Since the Meep Conda package for the 1.20 release is built using double-precision floating point, to use (1) you will need to compile Meep from source using the --enable-single flag. Benchmarking tests on AWS EC2 have demonstrated that these combined features can provide for more than a factor of 5X speedup in the timestepping rate for large 3d problems.

Separately, it would still be useful to investigate the performance impact of the various MPI libraries for single and multi-node MPI clusters.

@smartalecH
Copy link
Collaborator

I noticed around that IntelMPI implementation of MPI could offer some good speedup over many nodes

I spent significant time doing this and didn't see any noticeable gain in performance.

So far my predicted timing to get enough iterations is well beyond the maximum allowed walltime-per-job on my machine

Maybe try checkpointing your optimizations so you can run where you left off using multiple job submissions.

(2) decimation of the DFT field update

This isn't quite ready for the adjoint simulation.

@yugeniom
Copy link
Author

yugeniom commented Aug 31, 2021

For a significant speedup in the timestepping rate when using the adjoint solver for inverse design and topology optimization, you can try the three new features added in Meep 1.20: (1) single-precision floating point for the fields arrays, (2) decimation of the DFT field updates, and (3) memory locality for the step-curl updates via loop tiling. Since the Meep Conda package for the 1.20 release is built using double-precision floating point, to use (1) you will need to compile Meep from source using the --enable-single flag. Benchmarking tests on AWS EC2 have demonstrated that these combined features can provide for more than a factor of 5X speedup in the timestepping rate for large 3d problems. Separately, it would still be useful to investigate the performance impact of the various MPI libraries for single and multi-node MPI clusters.

Hi,
Thank you for the suggestion. I'll definitely try and compile meep with intel MPI as soon as I get access to the machine licensed with it, I'll let you know my impressions on a possible speed-up. Do you suggest any specific C flags which could be good to compile with for best performance of meep?
In the meanwhile, let me ask some more on your suggestions:
(2) is there some rule of thumb to choose a decimation without impacting on the accuracy, or should I run tests for every different simulation?
(3)I am already using meep 1.20; are the improvements on the step-curl updates active by default in it or should I activate this feature by hand somehow?

@yugeniom
Copy link
Author

yugeniom commented Aug 31, 2021

I spent significant time doing this and didn't see any noticeable gain in performance.

How far did you go with the number of nodes and processes? from the data here aws/aws-parallelcluster#1436, a possible speedup looks growing with the number of processes involved. Consider that I'll try running on up to ~500 nodes for my need... Could some trial and error on the possible C flags to specify during compiling be worth, in your opinion?

Maybe try checkpointing your optimizations so you can run where you left off using multiple job submissions.

Oh thank you very much! I didn't know meep also allowed for checkpointing easily! That's great! Can I do that also in the context of adjoint-method optimizations without issues?

This isn't quite ready for the adjoint simulation.

May I ask you why? In any case is some kind of automatic decimation already active by default? (I don't remeber, I may have just read it somewhere here); should I disable it in such a case, for adjoint-method optimiztions?

@oskooi
Copy link
Collaborator

oskooi commented Sep 1, 2021

Do you suggest any specific C flags which could be good to compile with for best performance of meep?

#1628 recently added support for multithreading via OpenMP for the fields update. You can therefore (in addition to --enable-single) also try compiling using --with-openmp and specifying the environment variable OMP_NUM_THREADS at runtime, e.g., env OMP_NUM_THREADS=2 mpirun -np 2 python foo.py. Also, you may want to experiment with compiling with --disable-portable-binary in order to let your compiler potentially take advantage of your native architecture; for more details on this flag see: https://meep.readthedocs.io/en/latest/Build_From_Source/#meep.

(2) is there some rule of thumb to choose a decimation without impacting on the accuracy, or should I run tests for every different simulation?

The optimal decimation factor which takes into account the band-limited nature of your (pulsed) source and monitor bandwidth is chosen for you automatically by default (#1732) except if you are using the adjoint solver (#1751). In that case and until #1751 is resolved, you will need to manually set the decimation_factor when defining the OptimizationProblem object. Some experimentation is required: too small a decimation_factor and you will likely not see much performance improvement whereas too large a value will cause aliasing and degrade the accuracy.

(3)I am already using meep 1.20; are the improvements on the step-curl updates active by default in it or should I activate this feature by hand somehow?

The loop tiling feature is part of the Meep 1.20 release. It is disabled by default. To use it, try setting loop_tile_base=10000 when defining the Simulation object. See the documentation for loop_tile_base which was recently added to https://meep.readthedocs.io/en/latest/Python_User_Interface/#simulation.

@yugeniom
Copy link
Author

yugeniom commented Sep 8, 2021

Maybe try checkpointing your optimizations so you can run where you left off using multiple job submissions.

Hi, I am giving a look into this; is checkpointing somehow build-in or should I trying setting it up? In case do you have suggestions on how to proceed in the case of an adjoint optimization? Maybe I should post this here https://github.com/stevengj/nlopt/issues ?

Thanks

@smartalecH
Copy link
Collaborator

No that's something you'll have to set up yourself.

@yugeniom
Copy link
Author

yugeniom commented Sep 8, 2021

No that's something you'll have to set up yourself.

I see; I'll give a try; do you foresee any possible issues in implementing it in the context of the adjoint optimization with nlopt?

@stevengj
Copy link
Collaborator

stevengj commented Sep 8, 2021

Issue seems resolved by this comment.

@stevengj stevengj closed this as completed Sep 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants