Skip to content

Commit

Permalink
Enable single-precision floating point for DFT fields arrays (NanoCom…
Browse files Browse the repository at this point in the history
…p#1675)

* enable single-precision floating point for DFT fields arrays

* update docs

* update benchmarking results in docs

* use modified DFT field update for real time-domain fields due to improved performance using clang
  • Loading branch information
oskooi authored and mawc2019 committed Nov 3, 2021
1 parent 28460a1 commit 533dc5f
Show file tree
Hide file tree
Showing 4 changed files with 17 additions and 15 deletions.
6 changes: 3 additions & 3 deletions doc/docs/Build_From_Source.md
Original file line number Diff line number Diff line change
Expand Up @@ -246,14 +246,14 @@ This flag enables some experimental support for [OpenMP](https://en.wikipedia.or

### Floating-Point Precision of the Fields and Materials Arrays

By default, the C/C++ arrays used in Meep to store the time-domain fields ($\mathbf{E}$, $\mathbf{D}$, $\mathbf{H}$) and materials ($\varepsilon$) are defined using [double-precision floating point](https://en.wikipedia.org/wiki/Double-precision_floating-point_format). Updating the fields arrays generally dominates the computational cost of the simulation because it occurs at every voxel in the cell and at every timestep. Because [discretization errors](https://en.wikipedia.org/wiki/Discretization_error) which include the discontinuous material interfaces (as described in [Subpixel Smoothing](Subpixel_Smoothing.md)) as well as the [numerical dispersion](https://en.wikipedia.org/wiki/Numerical_dispersion) of the Yee grid typically dominates the [floating-point roundoff error](https://en.wikipedia.org/wiki/Round-off_error), the fields/materials arrays can be defined using [single-precision floating point](https://en.wikipedia.org/wiki/Single-precision_floating-point_format) to provide a significant speedup (by reducing the [memory bandwidth](https://en.wikipedia.org/wiki/Memory_bandwidth)) often without *any loss* in simulation accuracy.
By default, the C++ arrays used in Meep to store the time-domain fields ($\mathbf{E}$, $\mathbf{D}$, $\mathbf{H}$, $\mathbf{B}$) and materials ($\varepsilon$, $\mu$) are defined using [double-precision floating point](https://en.wikipedia.org/wiki/Double-precision_floating-point_format). Updating the fields arrays generally dominates the computational cost of the simulation because it occurs at every voxel in the cell and at every timestep. Because [discretization errors](https://en.wikipedia.org/wiki/Discretization_error) which include the [discontinuous material interfaces](Subpixel_Smoothing.md) as well as the [numerical dispersion](https://en.wikipedia.org/wiki/Numerical_dispersion) of the Yee grid typically dominates the [floating-point roundoff error](https://en.wikipedia.org/wiki/Round-off_error), the fields and materials arrays can be defined using [single-precision floating point](https://en.wikipedia.org/wiki/Single-precision_floating-point_format) to provide a significant speedup by reducing the [memory bandwidth](https://en.wikipedia.org/wiki/Memory_bandwidth) often without any loss in simulation accuracy.

This feature requires one to pass the `--enable-single` flag to
the Meep `configure` script before compiling.

As a demonstration of the potential improvement in runtime performance, for an experiment involving [computing the light-extraction efficiency of an OLED](http://www.simpetus.com/projects.html#meep_oled) which includes PMLs, DFT flux monitors, and Lorentzian susceptibilities, the timestepping rate (s/step) for the single-precision case using 20 MPI processes was ~50% that of double precision.
As a demonstration of the potential improvement in runtime performance, for a benchmarking experiment based on [computing the light-extraction efficiency of an OLED](https://gist.github.com/oskooi/f745c467ae54e192d5c8340eace9a780) which includes PMLs, DFT flux monitors, and Lorentzian susceptibilities (for material dispersion), the timestepping rate (s/step) for the single-precision case using 20 MPI processes was *less than half* that of double precision.

The DFT fields, however, are always defined using double-precision floating point. This is intended to mitigate the accumulation of round-off error for simulations with a large number of timesteps.
Compiling with `--enable-single` will also change the type of the DFT fields arrays from double- to single-precision floating point. Reducing the floating-point precision generally does not affect the DFT accuracy and can provide a speedup to the DFT fields updates particularly when the simulation involves large-size DFT monitors with a fine frequency mesh which is typical for the [adjoint solver](Python_Tutorials/Adjoint_Solver.md). As an example, for the same OLED benchmarking test using a single thread, the time spent on the DFT fields updates was *nearly halved* when switching from double to single precision.

### Separating Build and Source Paths

Expand Down
18 changes: 9 additions & 9 deletions src/dft.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -97,11 +97,11 @@ dft_chunk::dft_chunk(fields_chunk *fc_, ivec is_, ivec ie_, vec s0_, vec s1_, ve

const int Nomega = data->omega.size();
omega = data->omega;
dft_phase = new complex<double>[Nomega];
dft_phase = new complex<realnum>[Nomega];

N = 1;
LOOP_OVER_DIRECTIONS(is.dim, d) { N *= (ie.in_direction(d) - is.in_direction(d)) / 2 + 1; }
dft = new complex<double>[N * Nomega];
dft = new complex<realnum>[N * Nomega];
for (size_t i = 0; i < N * Nomega; ++i)
dft[i] = 0.0;
for (int i = 0; i < 5; ++i)
Expand Down Expand Up @@ -238,7 +238,7 @@ void dft_chunk::update_dft(double time) {
}
else
w = 1.0;
double f[2]; // real/imag field value at epsilon point
realnum f[2]; // real/imag field value at epsilon point
if (avg2)
for (int cmp = 0; cmp < numcmp; ++cmp)
f[cmp] = (w * 0.25) * (fc->f[c][cmp][idx] + fc->f[c][cmp][idx + avg1] +
Expand All @@ -251,14 +251,14 @@ void dft_chunk::update_dft(double time) {
f[cmp] = w * fc->f[c][cmp][idx];

if (numcmp == 2) {
complex<double> fc(f[0], f[1]);
complex<realnum> fc(f[0], f[1]);
for (int i = 0; i < Nomega; ++i)
dft[Nomega * idx_dft + i] += dft_phase[i] * fc;
}
else {
double fr = f[0];
realnum fr = f[0];
for (int i = 0; i < Nomega; ++i)
dft[Nomega * idx_dft + i] += dft_phase[i] * fr;
dft[Nomega * idx_dft + i] += std::complex<realnum>{fr * dft_phase[i].real(), fr * dft_phase[i].imag()};
}
idx_dft++;
}
Expand Down Expand Up @@ -305,7 +305,7 @@ void save_dft_hdf5(dft_chunk *dft_chunks, const char *name, h5file *file, const

for (dft_chunk *cur = dft_chunks; cur; cur = cur->next_in_dft) {
size_t Nchunk = cur->N * cur->omega.size() * 2;
file->write_chunk(1, &istart, &Nchunk, (double *)cur->dft);
file->write_chunk(1, &istart, &Nchunk, (realnum *)cur->dft);
istart += Nchunk;
}
file->done_writing_chunks();
Expand Down Expand Up @@ -333,7 +333,7 @@ void load_dft_hdf5(dft_chunk *dft_chunks, const char *name, h5file *file, const

for (dft_chunk *cur = dft_chunks; cur; cur = cur->next_in_dft) {
size_t Nchunk = cur->N * cur->omega.size() * 2;
file->read_chunk(1, &istart, &Nchunk, (double *)cur->dft);
file->read_chunk(1, &istart, &Nchunk, (realnum *)cur->dft);
istart += Nchunk;
}
}
Expand Down Expand Up @@ -823,7 +823,7 @@ complex<double> dft_chunk::process_dft_component(int rank, direction *ds, ivec m
? parent->get_eps(loc)
: c_conjugate == Permeability
? parent->get_mu(loc)
: dft[omega.size() * (chunk_idx++) + num_freq] / stored_weight);
: complex<double>(dft[omega.size() * (chunk_idx++) + num_freq]) / stored_weight);
if (include_dV_and_interp_weights) dft_val /= (sqrt_dV_and_interp_weights ? sqrt(w) : w);

complex<double> mode1val = 0.0, mode2val = 0.0;
Expand Down
4 changes: 2 additions & 2 deletions src/meep.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -1117,7 +1117,7 @@ class dft_chunk {
component c; // component to DFT (possibly transformed by symmetry)

size_t N; // number of spatial points (on epsilon grid)
std::complex<double> *dft; // N x Nomega array of DFT values.
std::complex<realnum> *dft; // N x Nomega array of DFT values.

class dft_chunk *next_in_chunk; // per-fields_chunk list of DFT chunks
class dft_chunk *next_in_dft; // next for this particular DFT vol./component
Expand Down Expand Up @@ -1168,7 +1168,7 @@ class dft_chunk {
int sn;

// cache of exp(iwt) * scale, of length Nomega
std::complex<double> *dft_phase;
std::complex<realnum> *dft_phase;

ptrdiff_t avg1, avg2; // index offsets for average to get epsilon grid

Expand Down
4 changes: 3 additions & 1 deletion src/stress.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,9 @@ static void stress_sum(size_t Nfreq, double *F, const dft_chunk *F1, const dft_c
complex<double> extra_weight(real(curF1->extra_weight), imag(curF1->extra_weight));
for (size_t k = 0; k < curF1->N; ++k)
for (size_t i = 0; i < Nfreq; ++i)
F[i] += real(extra_weight * curF1->dft[k * Nfreq + i] * conj(curF2->dft[k * Nfreq + i]));
F[i] += real(extra_weight *
complex<double>(curF1->dft[k * Nfreq + i]) *
conj(complex<double>(curF2->dft[k * Nfreq + i])));
}
}

Expand Down

0 comments on commit 533dc5f

Please sign in to comment.