A Julia package for solving acoustic wave propagation in 2D and 3D on multi-xPUs using CPML boundary conditions.
Seismic tomography is the predominant methodology geoscientists use to investigate the structure and properties of the inaccessible Earth’s interior. Inferring properties of the subsurface from seismic data measured on the surface of the Earth, i.e., solving the tomographic inverse problem, requires the simulation of the physics of wave propagation (the forward problem).
The forward problem can be solved numerically by suitable methods for PDEs (boundary value problems). These methods quickly become computationally expensive when a fine discretization of the subsurface is needed and/or when working with 3D models. Because of that, exploiting parallelization and computational resources is crucial to obtain results in a reasonable amount of time, allowing for the solution of the inverse problem.
In particular, GPUs are becoming increasingly employed to speed-up computations of finite difference methods, because of the regularity of their computation pattern.
Currently, there is a lack of availability of open source software packages combining easiness of use, good performance and scalability to address the above-mentioned calculations. The Julia programming language is then a natural candidate to fill such a gap. Julia has recently emerged among computational physicists because of a combination of easiness of use (reminiscent of languages like Python and MatLab) and its speed. Moreover, Julia natively supports several parallelization paradigms (such as shared memory, message passing and GPU computing) and recently developed packages like ParallelStencil.jl and ImplicitGlobalGrid.jl, which make for a simple, yet powerful, way to implement scalable, efficient, maintainable and hardware-agnostic parallel algorithms.
We have implemented some time-domain finite-difference solvers for acoustic wave propagation with CPML boundary conditions on GPUs using Julia. Extensive benchmarks and performance evaluations of the developed parallel code have been conducted to assess the gain in performance and the possibility to scale on multi-GPUs.
The main PDE to solve is the classical acoustic wave equation:
where
If we also consider a source term
Consider some rectangular area of space
If we choose a free boundary condition (i.e. a homogenous Dirichlet BDC), we get the following BVP:
Free boundary condition conditions have the issue of creating reflection waves from the boundary, hence they are not suitable for inverse problems forward solvers. One of the possible "solutions" is to run the simulation on a much bigger model and prescribe a final time
The main concept of CPML BDCs is to enhance the standard wave equation at the boundaries with additional terms
where
These two new fields
where
The coefficients
and
where:
-
$t_i$ is the thickness (in meters) of the CPML boundary in dimension$i$ , -
$N$ is a power coefficient (we used$N=2$ ), -
$f$ is the dominating frequency of the source$s$ , -
$d_i(x)$ is the normalized distance (between$0$ and$1$ ) from$x$ to the interior part$\overline{\Omega}$ of$\Omega$ in dimension$i$ .
We picked some experimentally determined coefficients from reference studies of CPML BDCs on the acoustic wave equation (see [1] and [2]). These are the following:
$K_{max} = 1$ $\alpha_{max} = \pi f$ -
$R = [0.01, 0.001, 0.0001]$ for$[5, 10, 20]$ number of extra CPML nodal points respectively.
One crucial observation is that each component of the
The full set of equations with boundary conditions then becomes:
We used the most simple finite-difference scheme: second order explicit FD schemes for both space and time. We stagger the
We split time step computations into two substeps:
- first we update the
$\psi_i$ fields, - then we update the
$\xi_i$ and pressure fields at the same time.
This separation is crucial to ensure that the
We have run several simulations in 1D, 2D and 3D to ensure qualitative and quantitative analysis of CPML boundary conditions in various setups. In this section, we briefly explain the setups used.
Since solutions for constant velocity models with one point source are known (see [5]), we first compared numerical solutions with analytical ones.
We tested two scenarios:
- no CPML layers with a cut final time to not get reflections,
- 20 CPML layers with a bigger final time to test for appropriate damping.
For the source terms, we always use a Ricker wavelet source (second derivative of a Gaussian) to simulate an "impulse" source.
We recorded the pressure trace at a receiver placed at a fixed distance from the source and compared it to the analytical solution to verify for correctedness. You can find these tests under the test
folder for each version.
We have mainly three types of setups:
- constant velocity models with a single source located in the center and CPML layers in all boundary regions;
- gradient velocity (directly proportional to depth) with multiple sources located at the top of the model and NO CPML layers on the top boundary (to simulate free top boundary conditions).
- complex velocity model (from [3]) with multiple sources located at the top of the model and NO CPML layers on the top boundary.
We use different configurations of CPML layers (0, 5, 10 or 20 layers in each boundary) with different reflection coefficients (these are shown in the animation plots).
We record traces at some selected positions to observe the damping effect of CPML BDCs.
For the 1D simulations, we show a simple constant velocity model with 201 grid points and one source in the center. The animations for different CPML layers thicknesses are shown below:
CPML layers = 0 | CPML layers = 5 |
CPML layers = 10 | CPML layers = 20 |
We can see that without CPML layers, the wave gets reflected at the boundaries. Starting to add CPML layers, the wave gets absorbed by the boundaries, the more layers the less spurious oscillations we get back to the source.
In these simulations we tested the 2D CPML solver by looking at spurious reflection waves coming from the boundaries using different amounts of CPML layers. We use a threshold to cut low pressure values.
We show the maximum absolute pressure value in the whole field to see how much of the wave dissipates after hitting the boundaries.
CPML layers = 0 | CPML layers = 5 |
CPML layers = 10 | CPML layers = 20 |
We can see that, as in the 1D case, adding more layers helps with reducing the amplitude of the reflected waves.
In this simulation we wanted to tackle a non-constant velocity model, but still pretty simple. The velocity is shown in the plots as a grayscale gradient in the background, with higher values corresponding to darker colors. As usual, we compare different CPML layers amounts. We use a bigger pressure threshold to cut more low pressure values and only see the "main wavefronts". Here we still should not see any reflections coming from velocity differences, since the gradient is pretty smooth and for those we would need sharper shocks.
CPML layers = 0 | CPML layers = 5 |
CPML layers = 10 | CPML layers = 20 |
Here we investigated a complex velocity model from [3] to see how the code behaves with reflections from the model itself. Here note that we used quite a high threshold for pressure shown in the animations.
CPML layers = 0 | CPML layers = 20 |
We compare the models with and without CPML layers and see that we get quite good results: the waves are mostly absorbed and the maximum absolute pressure value decreases after all waves reach the boundaries.
For 3D simulations, we mostly wanted to make sure that the code behaved similarly to the 2D code. We plot a slice of the model at
CPML layers = 0 | |
---|---|
2D nz/2 slice | 3D isometry |
CPML layers = 20 | |
---|---|
2D nz/2 slice | 3D isometry |
2D nz/2 slice | 3D isometry |
This complex velocity model has a quite high resolution, so we ran this using 8 GPUs on Piz Daint. The simulation was fairly quick (8000 timesteps in 30 seconds), but it produced a lot of data that then needed to be post-processed for visualization.
We can see a bunch of reflections from the velocity differences in the model, but at some point all waves are mostly absorbed.
In this section we show performance evaluations of the 2D and 3D implementations run on both CPUs and GPUs.
We are interested in assessing the effective memory throughput Teff
[GB/s], which measures the amount of data that needs to be loaded/stored for the computation to be performed. It is computed similarly for 2D and 3D versions by analyzing the code structure and counting each field that gets only loaded once and each field that gets loaded and updated twice. Why don't we look at floating point operations per second? Because the computation we have to perform is memory-bound, so using a metric for compute-bound operations would be unfair since we could never reach peak performance that way.
We also do a weak scale analysis for the multi-xPU implementations to see if we scale well on multiple nodes.
GPUs benchmarks were run on Piz Daint Cray XC50 Compute Nodes, each of them with an NVIDIA® Tesla® P100 16GB, peak bandwidth measured with a triad benchmark (around 559 GB/s), Julia version 1.7.2.
CPUs benchmarks were run on a MacBook Pro 2017 laptop, equipped with an Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz with 2 physical cores, peak bandwidth measured with STREAM (around 22 GB/s), Intel Turbo Boost disabled while running. Julia version 1.8.1.
Here we plot the effective memory throughput for 2D and 3D implementations on both a 2-core CPU and a single GPU, as a function of increasing model size (in the number of grid points).
Each measurement (i.e. point in the plot) was conducted using the @belapsed
macro of the BenchmarkTools toolkit. This means that we measure multiple times and we take the minimum of all the measurements. We also perform a statistical analysis of the set of runs by computing the 95% confidence interval and checking that this interval lies inside the +-5% of the median. This ensures that random noise is not affecting the measurements too much.
We can see that the performance on the GPU gets better as soon as the problem size is big enough, and then we get a plateau. On the contrary, big models do not perform as well on the CPU because of a slower memory bandwidth. For the 2D computation, we reach around 87% of GPU peak performance, while for 3D we reach around 60%. This is expected, since the GPU memory pattern and threads scheduling is not optimized for 3D computations.
Here we investigate if our code can be used to run bigger models on multiple xPUs in the same amount of time as smaller models on a single xPU. We only measure GPUs on multiple Piz Daint nodes, since the availability of a fast network is essential to get good scaling across multiple nodes.
We use as baseline time the best one that was measured for a single node computation.
For these measurements we use a different approach: we run some timesteps, starting to measure time after the first bunch of them, and then return the time it took to run the remaining ones. For 2D we run 1000 timesteps and skip the first 200, for 3D we run 100 timesteps and skip the first 19. Also, the starting model size for 2D is
We can see that we obtain a very high and stable weak scaling efficiency at around 95.8% for 2D and around 99.5% for 3D, at least for the amount of nodes that we could measure.
CAVEATS: benchmarks were performed using the @hide_communication
macro of ImplicitGlobalGrid with factors (2,2)
and (2,2,2)
for 2D and 3D multixPUs versions, which is not currently supported because module precompilation breaks if used in combination with ParallelStencil. You can use this feature by importing and initializing ParallelStencil and ImplicitGlobalGrid before including the source file containing the solver you want to use. Then you can uncomment the lines with @hide_communication
and also specify with the parameter b_width
its argument.
We have shown that it is possible to solve acoustic equation with finite differences on GPUs in both 2D and 3D models efficiently.
We can scale the model size arbitrarily by distributing computation onto multiple GPUs (we need to have them available though! also costly to use a big number of GPUs on Piz Daint)
CPML boundary conditions seem to work pretty well, at least for ricker wavelet sources with a specific frequency. Comparing our solver with other ones that are known to be correct is one of the next steps.
WIP: functionalities to compute gradients with respect to velocities using the adjoint method and residuals from available data and checkpointing features.
[3] SEG/EAGE Salt and Overthrust Models
[4] Piz Daint
[5] Computational Seismology: A Practical Introduction, Heiner Igel, Oxford University Press 2016
Here is written some information on documentation and how to use the scripts to reproduce results.
We provide ready-to-use scripts to run simulations shown in this README. The scripts are inside the script
folder and are called run_<solver_version_here>.jl
. They run the same setups that we described (and some more). You can take inspiration from those scripts to run your own models and simulations configurations.
We also provide some scripts for running benchmarks. You can find these in the scripts/benchmarks
folder. We also have scripts for plotting benchmarks in this folder. Data collected from performed benchmarks is here too.
We also provide submissions script for Piz Daint in the submit_daint
folder. Here you can find various scripts for submitting jobs on the Slurm queue and some results from our runs. There is data to use for weak scaling efficiency plots here.
REMEMBER: If you wish to run multi-xPUs implementations, you should use your own mpiexecjl