Enable single-precision floating point for DFT fields arrays (NanoCom…

…p#1675) * enable single-precision floating point for DFT fields arrays * update docs * update benchmarking results in docs * use modified DFT field update for real time-domain fields due to improved performance using clang
mawc2019 · Nov 3, 2021 · 533dc5f · 533dc5f
1 parent 28460a1
commit 533dc5f
Show file tree

Hide file tree

Showing 4 changed files with 17 additions and 15 deletions.
diff --git a/doc/docs/Build_From_Source.md b/doc/docs/Build_From_Source.md
@@ -246,14 +246,14 @@ This flag enables some experimental support for [OpenMP](https://en.wikipedia.or
 
 ### Floating-Point Precision of the Fields and Materials Arrays
 
-By default, the C/C++ arrays used in Meep to store the time-domain fields ($\mathbf{E}$, $\mathbf{D}$, $\mathbf{H}$) and materials ($\varepsilon$) are defined using [double-precision floating point](https://en.wikipedia.org/wiki/Double-precision_floating-point_format). Updating the fields arrays generally dominates the computational cost of the simulation because it occurs at every voxel in the cell and at every timestep. Because [discretization errors](https://en.wikipedia.org/wiki/Discretization_error) which include the discontinuous material interfaces (as described in [Subpixel Smoothing](Subpixel_Smoothing.md)) as well as the [numerical dispersion](https://en.wikipedia.org/wiki/Numerical_dispersion) of the Yee grid typically dominates the [floating-point roundoff error](https://en.wikipedia.org/wiki/Round-off_error), the fields/materials arrays can be defined using [single-precision floating point](https://en.wikipedia.org/wiki/Single-precision_floating-point_format) to provide a significant speedup (by reducing the [memory bandwidth](https://en.wikipedia.org/wiki/Memory_bandwidth)) often without *any loss* in simulation accuracy.
+By default, the C++ arrays used in Meep to store the time-domain fields ($\mathbf{E}$, $\mathbf{D}$, $\mathbf{H}$, $\mathbf{B}$) and materials ($\varepsilon$, $\mu$) are defined using [double-precision floating point](https://en.wikipedia.org/wiki/Double-precision_floating-point_format). Updating the fields arrays generally dominates the computational cost of the simulation because it occurs at every voxel in the cell and at every timestep. Because [discretization errors](https://en.wikipedia.org/wiki/Discretization_error) which include the [discontinuous material interfaces](Subpixel_Smoothing.md) as well as the [numerical dispersion](https://en.wikipedia.org/wiki/Numerical_dispersion) of the Yee grid typically dominates the [floating-point roundoff error](https://en.wikipedia.org/wiki/Round-off_error), the fields and materials arrays can be defined using [single-precision floating point](https://en.wikipedia.org/wiki/Single-precision_floating-point_format) to provide a significant speedup by reducing the [memory bandwidth](https://en.wikipedia.org/wiki/Memory_bandwidth) often without any loss in simulation accuracy.
 
 This feature requires one to pass the `--enable-single` flag to
 the Meep `configure` script before compiling.
 
-As a demonstration of the potential improvement in runtime performance, for an experiment involving [computing the light-extraction efficiency of an OLED](http://www.simpetus.com/projects.html#meep_oled) which includes PMLs, DFT flux monitors, and Lorentzian susceptibilities, the timestepping rate (s/step) for the single-precision case using 20 MPI processes was ~50% that of double precision.
+As a demonstration of the potential improvement in runtime performance, for a benchmarking experiment based on [computing the light-extraction efficiency of an OLED](https://gist.github.com/oskooi/f745c467ae54e192d5c8340eace9a780) which includes PMLs, DFT flux monitors, and Lorentzian susceptibilities (for material dispersion), the timestepping rate (s/step) for the single-precision case using 20 MPI processes was *less than half* that of double precision.
 
-The DFT fields, however, are always defined using double-precision floating point. This is intended to mitigate the accumulation of round-off error for simulations with a large number of timesteps.
+Compiling with `--enable-single` will also change the type of the DFT fields arrays from double- to single-precision floating point. Reducing the floating-point precision generally does not affect the DFT accuracy and can provide a speedup to the DFT fields updates particularly when the simulation involves large-size DFT monitors with a fine frequency mesh which is typical for the [adjoint solver](Python_Tutorials/Adjoint_Solver.md). As an example, for the same OLED benchmarking test using a single thread, the time spent on the DFT fields updates was *nearly halved* when switching from double to single precision.
 
 ### Separating Build and Source Paths
 

diff --git a/src/dft.cpp b/src/dft.cpp
@@ -97,11 +97,11 @@ dft_chunk::dft_chunk(fields_chunk *fc_, ivec is_, ivec ie_, vec s0_, vec s1_, ve
 
   const int Nomega = data->omega.size();
   omega = data->omega;
-  dft_phase = new complex<double>[Nomega];
+  dft_phase = new complex<realnum>[Nomega];
 
   N = 1;
   LOOP_OVER_DIRECTIONS(is.dim, d) { N *= (ie.in_direction(d) - is.in_direction(d)) / 2 + 1; }
-  dft = new complex<double>[N * Nomega];
+  dft = new complex<realnum>[N * Nomega];
   for (size_t i = 0; i < N * Nomega; ++i)
     dft[i] = 0.0;
   for (int i = 0; i < 5; ++i)
@@ -238,7 +238,7 @@ void dft_chunk::update_dft(double time) {
     }
     else
       w = 1.0;
-    double f[2]; // real/imag field value at epsilon point
+    realnum f[2]; // real/imag field value at epsilon point
     if (avg2)
       for (int cmp = 0; cmp < numcmp; ++cmp)
         f[cmp] = (w * 0.25) * (fc->f[c][cmp][idx] + fc->f[c][cmp][idx + avg1] +
@@ -251,14 +251,14 @@ void dft_chunk::update_dft(double time) {
         f[cmp] = w * fc->f[c][cmp][idx];
 
     if (numcmp == 2) {
-      complex<double> fc(f[0], f[1]);
+      complex<realnum> fc(f[0], f[1]);
       for (int i = 0; i < Nomega; ++i)
         dft[Nomega * idx_dft + i] += dft_phase[i] * fc;
     }
     else {
-      double fr = f[0];
+      realnum fr = f[0];
       for (int i = 0; i < Nomega; ++i)
-        dft[Nomega * idx_dft + i] += dft_phase[i] * fr;
+        dft[Nomega * idx_dft + i] += std::complex<realnum>{fr * dft_phase[i].real(), fr * dft_phase[i].imag()};
     }
     idx_dft++;
   }
@@ -305,7 +305,7 @@ void save_dft_hdf5(dft_chunk *dft_chunks, const char *name, h5file *file, const
 
   for (dft_chunk *cur = dft_chunks; cur; cur = cur->next_in_dft) {
     size_t Nchunk = cur->N * cur->omega.size() * 2;
-    file->write_chunk(1, &istart, &Nchunk, (double *)cur->dft);
+    file->write_chunk(1, &istart, &Nchunk, (realnum *)cur->dft);
     istart += Nchunk;
   }
   file->done_writing_chunks();
@@ -333,7 +333,7 @@ void load_dft_hdf5(dft_chunk *dft_chunks, const char *name, h5file *file, const
 
   for (dft_chunk *cur = dft_chunks; cur; cur = cur->next_in_dft) {
     size_t Nchunk = cur->N * cur->omega.size() * 2;
-    file->read_chunk(1, &istart, &Nchunk, (double *)cur->dft);
+    file->read_chunk(1, &istart, &Nchunk, (realnum *)cur->dft);
     istart += Nchunk;
   }
 }
@@ -823,7 +823,7 @@ complex<double> dft_chunk::process_dft_component(int rank, direction *ds, ivec m
                    ? parent->get_eps(loc)
                    : c_conjugate == Permeability
                          ? parent->get_mu(loc)
-                         : dft[omega.size() * (chunk_idx++) + num_freq] / stored_weight);
+                         : complex<double>(dft[omega.size() * (chunk_idx++) + num_freq]) / stored_weight);
     if (include_dV_and_interp_weights) dft_val /= (sqrt_dV_and_interp_weights ? sqrt(w) : w);
 
     complex<double> mode1val = 0.0, mode2val = 0.0;

diff --git a/src/meep.hpp b/src/meep.hpp
@@ -1117,7 +1117,7 @@ class dft_chunk {
   component c; // component to DFT (possibly transformed by symmetry)
 
   size_t N;                   // number of spatial points (on epsilon grid)
-  std::complex<double> *dft; // N x Nomega array of DFT values.
+  std::complex<realnum> *dft; // N x Nomega array of DFT values.
 
   class dft_chunk *next_in_chunk; // per-fields_chunk list of DFT chunks
   class dft_chunk *next_in_dft;   // next for this particular DFT vol./component
@@ -1168,7 +1168,7 @@ class dft_chunk {
   int sn;
 
   // cache of exp(iwt) * scale, of length Nomega
-  std::complex<double> *dft_phase;
+  std::complex<realnum> *dft_phase;
 
   ptrdiff_t avg1, avg2; // index offsets for average to get epsilon grid
 

diff --git a/src/stress.cpp b/src/stress.cpp
@@ -93,7 +93,9 @@ static void stress_sum(size_t Nfreq, double *F, const dft_chunk *F1, const dft_c
     complex<double> extra_weight(real(curF1->extra_weight), imag(curF1->extra_weight));
     for (size_t k = 0; k < curF1->N; ++k)
       for (size_t i = 0; i < Nfreq; ++i)
-        F[i] += real(extra_weight * curF1->dft[k * Nfreq + i] * conj(curF2->dft[k * Nfreq + i]));
+        F[i] += real(extra_weight *
+                     complex<double>(curF1->dft[k * Nfreq + i]) *
+                     conj(complex<double>(curF2->dft[k * Nfreq + i])));
   }
 }