-
Notifications
You must be signed in to change notification settings - Fork 50
Benchmark AoSoA data layout performance
We use performance microkernels for benchmarking the library and its underlying data structure. We start with a kernel abstracted from a type of particle pushing routine. It mimicks what is sometimes called particle substepping or subcycling: pushing each particle under a given field for many timesteps (e.g. a loop for(i=0; i<nstep; i++)
). This is one possible scenario in a PIC code when a large time step is taken for the field advance, while smaller timesteps are needed for particle pushes. This benchmark is a simplified, but closely related, version of particle sub-cycling, with the particle loop is written as:
for( i = 0; i < nstep; i++){
for( j = 0; j < VECLEN; j++){
p0->v[j] = a->v[j] * p0->v[j] + c->v[j];
....
p9->v[j] = a->v[j] * p9->v[j] + c->v[j];
}
}
where we have used an SOA defined as
struct Particle{
float v[VECLEN];
....
};
for all quantities in the loop. Note that we have essentially unrolled a particle loop 10 times, which is to overcome some operation latencies and to improve instruction-level parallelism, so that the floating-point units of the CPU always have some work to do without waiting for data or instructions.
Such a kernel can be used to reach the peak flops of a CPU core. This is done by a simple function call of the above kernel with large enough nstep
. An example is shown by the figure below.
Number of Flops per clock cycle as a function of nstep . Produced on an Intel(R) Xeon(R) CPU E5-2660 v3 (Haswell), using Intel C++ compiler version 17.0.6, with compiler options -march=core-avx2 -O3 . |
We can easily use the same kernel to benchmark memory operations. Almost exactly the same kernel is memory-bound if we set nstep=1
. We can then measure the used bandwidth as a function of the number of particles. The particle loop becomes:
for( k = 0; k < num_partlist; k++){
for( j = 0; j < VECLEN; j++){
p0[k]->v[j] = a->v[j] * p0[k]->v[j] + c->v[j];
...
p9[k]->v[j] = a->v[j] * p9[k]->v[j] + c->v[j];
}
}
The next figure shows the used bandwidth measured by Byte/cycle as a function of the work size measured by kilobyte. Note that here L1 cache is 32 KB, L2 256 KB, and L3 25.6 MB. Compared to theoretical peak, which are 96 Byte/cycle and 32 Byte/cycle in L1 and L2 caches, the kernel reaches about 87% and 75% efficiencies in L1 and L2 caches, respectively. It is worth noting that aligned memory allocation (using _mm_malloc()
) is important in the L1 cache performance. The performance drops by a factor of two if malloc()
or new
is used.
Number of memory bytes per clock cycle as a function of worksize. The worksize is the total number of bytes of all particles. Produced on a Intel(R) Xeon(R) CPU E5-2660 v3 (Haswell), using Intel C++ compiler version 17.0.6, with compiler options -march=core-avx2 -O3 . |
It is worthwhile to combine the above two bencharks together to scan the combined flop-memory parameter space. This is particularly useful because we can then look at the kernel either in the memory-bound regime, compute-bound regime, or something in between. The figure below shows a 2D scan of flop-memory parameters. We see some variations for low nstep
with respect to the worksize that corresponds to cache sizes. As nstep
increases, such variations becomes less and less visible, and eventually reach the peak flops independent of the worksize.
Number of flops per clock cycle as a function of worksize and nstep. Produed on a Intel(R) Xeon(R) CPU E5-2660 v3 (Haswell), using Intel C++ compiler version 17.0.6, with compiler options -march=core-avx2 -O3 . |
Performance tests comparing pure C++, Kokkos, and Cabana implementations of the peakflops kernel:
Example: Peakflops Performance Test
Cabana - A Co-Designed Library for Exascale Particle Simulations