Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable single-precision floating point for DFT fields arrays #1675

Merged
merged 4 commits into from
Jul 19, 2021

Conversation

oskooi
Copy link
Collaborator

@oskooi oskooi commented Jul 15, 2021

#1544 enabled single-precision floating point for the time-domain fields. However, that PR did not change the DFT fields which are always stored using double precision. The DFT field updates is often the performance bottleneck in the timestepping for the adjoint solver due to the fact that the entire design region is a DFT fields monitor typically with a fine frequency mesh (i.e., a large number of spatial and frequency points which need to be updated at every timestep).

In order to reduce the memory bandwidth even further than what was enabled by #1544, this PR modifies the default type of the DFT fields arrays to switch to single precision when compiling with the --enable-single flag. This PR only modifies the DFT field updates in dft.cpp and leaves other functions which use the DFT fields (process_dft_component, get_dft_arrays, etc.) unchanged because they are not performance critical. When running the test suite via make check, 6/19 of the C++ unit tests are failing due to slight differences in the hard-coded values which is expected.

The performance improvement enabled by this PR for a benchmarking test involving an OLED device with multiple DFT monitors is significant (see this gist showing simulation script and results). The time spent on the DFT field updates was reduced by more than a factor of 12 nearly halved when switching from double to single precision practically without any loss in the accuracy of the flux values.

@stevengj
Copy link
Collaborator

The factor of 12 seems hard to believe. One hypothesis is that you are getting lucky and single precision is just fitting into cache — an easy way to check this would be to double the number of frequencies.

@ahoenselaar
Copy link
Contributor

We should rerun the performance comparisons with monitors on the Yee grid.

@stevengj
Copy link
Collaborator

I think it is fine to just change the performance-critical arrays here (the ones that are updated on every timestep).

@ahoenselaar
Copy link
Contributor

Additional changes required here: https://github.com/NanoComp/meep/blob/master/python/meep.i#L421

@oskooi
Copy link
Collaborator Author

oskooi commented Jul 16, 2021

Contrary to the earlier suggestion by @ahoenselaar, no changes to get_dft_array, etc. seem to be required in this PR because nothing is broken by these changes. There are five Python tests which call the get_dft_array function: test_adjoint_solver.py, test_array_metadata.py, test_dft_fields.py, test_gaussian_beam.py, test_n2f_periodic.py. Three of these tests (test_dft_fields.py, test_gaussian_beam.py, test_n2f_periodic.py) pass using this branch compiled with --enable-single. The two failing tests (test_adjoint_solver.py test_array_metadata.py) are due to slight numerical differences in the fields and not to anything related to get_dft_array. In fact, the failing 21/49 Python tests (list shown below) are all due to slight numerical differences in the fields similar to the failing C++ tests.

This means that this PR can probably be merged as-is without any additional changes.

FAIL: tests/test_3rd_harm_1d.py
PASS: tests/test_absorber_1d.py
FAIL: tests/test_adjoint_solver.py
FAIL: tests/test_adjoint_jax.py
PASS: tests/test_antenna_radiation.py
PASS: tests/test_array_metadata.py
PASS: tests/test_bend_flux.py
PASS: tests/test_binary_grating.py
FAIL: tests/test_cavity_arrayslice.py
FAIL: tests/test_cavity_farfield.py
PASS: tests/test_chunk_layout.py
FAIL: tests/test_chunks.py
PASS: tests/test_cyl_ellipsoid.py
PASS: tests/test_dft_energy.py
PASS: tests/test_dft_fields.py
PASS: tests/test_diffracted_planewave.py
FAIL: tests/test_dispersive_eigenmode.py
PASS: tests/test_divide_mpi_processes.py
FAIL: tests/test_eigfreq.py
PASS: tests/test_faraday_rotation.py
FAIL: tests/test_field_functions.py
PASS: tests/test_force.py
PASS: tests/test_fragment_stats.py
PASS: tests/test_gaussianbeam.py
PASS: tests/test_geom.py
FAIL: tests/test_get_point.py
FAIL: tests/test_holey_wvg_bands.py
FAIL: tests/test_holey_wvg_cavity.py
PASS: tests/test_kdom.py
PASS: tests/test_ldos.py
PASS: tests/test_material_grid.py
PASS: tests/test_medium_evaluations.py
FAIL: tests/test_mode_coeffs.py
PASS: tests/test_mode_decomposition.py
FAIL: tests/test_multilevel_atom.py
PASS: tests/test_n2f_periodic.py
PASS: tests/test_oblique_source.py
PASS: tests/test_physical.py
PASS: tests/test_prism.py
FAIL: tests/test_pw_source.py
FAIL: tests/test_refl_angular.py
FAIL: tests/test_ring.py
FAIL: tests/test_ring_cyl.py
FAIL: tests/test_simulation.py
PASS: tests/test_special_kz.py
PASS: tests/test_source.py
FAIL: tests/test_user_defined_material.py
PASS: tests/test_visualization.py
FAIL: tests/test_wvg_src.py
============================================================================
Testsuite summary for meep 1.20.0-beta
============================================================================
# TOTAL: 49
# PASS:  28
# SKIP:  0
# XFAIL: 0
# FAIL:  21
# XPASS: 0
# ERROR: 0

@ahoenselaar
Copy link
Contributor

Ah yes! The conversion from realnum to double occurs in line 820 in dft.cpp, before any of the routines in the SWIG wrapper get exposure to it.

@oskooi
Copy link
Collaborator Author

oskooi commented Jul 16, 2021

Ah yes! The conversion from realnum to double occurs in line 820 in dft.cpp, before any of the routines in the SWIG wrapper get exposure to it.

That's correct. get_dft_array therefore always returns its result as double-precision floating point regardless of the type of the actual DFT fields. The key point is that this is not performance critical as get_dft_array is typically not called at every timestep. It would be good to fix this at some point but this can be addressed in a separate PR.

src/dft.cpp Outdated Show resolved Hide resolved
@oskooi
Copy link
Collaborator Author

oskooi commented Jul 17, 2021

Following the suggestion from @ahoenselaar, I reran the benchmarking results using the DFT fields with yee_grid=True (rather than DFT flux) with gcc and clang using the same single-core Intel Kaby Lake 4.2 GHz. For this test configuration (see gist), the time spent on the DFT fields updates for the single-precision floating point was nearly half that of double precision as expected. (single: 0.0140958 ± 0.0003646 s, double: 0.0237407 ± 0.0006847 s) The results were similar with yee_grid=False and also independent of the choice of compiler. I have updated the documentation with these results.

As additional verification, I reran the original benchmarking test with the DFT flux reported in the initial comment. This time I was only able to demonstrate an expected speedup of ~2X for single precision and not ~10X as initially reported which is reassuring. This is because in my original comment I was comparing the single-precision results from this branch to the master branch compiled with --enable-debug which was turning off the optimization and therefore producing much slower results by comparison.

@stevengj stevengj merged commit b5f6cb7 into NanoComp:master Jul 19, 2021
@stevengj
Copy link
Collaborator

LGTM, thanks.

@oskooi oskooi deleted the dft_realnum branch July 19, 2021 20:29
bencbartlett pushed a commit to bencbartlett/meep that referenced this pull request Sep 9, 2021
…p#1675)

* enable single-precision floating point for DFT fields arrays

* update docs

* update benchmarking results in docs

* use modified DFT field update for real time-domain fields due to improved performance using clang
mawc2019 pushed a commit to mawc2019/meep that referenced this pull request Nov 3, 2021
…p#1675)

* enable single-precision floating point for DFT fields arrays

* update docs

* update benchmarking results in docs

* use modified DFT field update for real time-domain fields due to improved performance using clang
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants