doc/how_to_run.txt


Running the sweep miniapp
-------------------------

Usage
-----

./sweep [ --<setting_name> <setting_value> ] ...

aprun [ <aprun_arg> ] ... sweep [ --<setting_name> <setting_value> ] ...

Settings
--------

--ncell_x

  The global number of gridcells along the X dimension.

--ncell_y

  The global number of gridcells along the Y dimension.

--ncell_z

  The global number of gridcells along the Z dimension.

--ne

  The total number of energy groups.  For realistic simulations may
  range from 1 (small) to 44 (normal) to hundreds (not typical).

--na

  The number of angles for each octant direction.  For realistic simulations
  a typical number would be up to 32 or more.

  NOTE: the number of moments is specified as a compile-time value, NM.
  Typical values are 1, 4, 16 and 36.

  NOTE: for CUDA builds, the angle and moment axes are always fully threaded.

--niterations

  The number of sweep iterations to perform.  A setting of 1 iteration
  is sufficient to demonstrate the performance characteristics of the
  code.

--nproc_x

  Available for MPI builds. The number of MPI ranks used to decompose
  along the X dimension.

--nproc_y

  Available for MPI builds. The number of MPI ranks used to decompose
  along the Y dimension.

--nblock_z

  The number of sweep blocks used to tile the Z dimension.  Currently must
  divide ncell_z exactly.  Blocks along the Z dimension are kept on
  the same MPI rank.
  The algorithm is a wavefront algorithm, where every block is
  considered as a node of the wavefront grid for the wavefront calculation.

  NOTE: when nthread_octant==8, setting nblock_z such that nblock_z % 2 == 0
  can considerably increase performance.

--is_using_device

  Available for CUDA builds.  Set to 1 to use the GPU, 0 for
  CPU-only (default).

--is_face_comm_async

  For MPI builds, 1 to use asynchronous communication (default),
  0 for synchronous only.

--nthread_octant

  For OpenMP or CUDA builds, the number of threads deployed to octants.
  The total number of threads equals the product of all thread counts
  along problem axes.
  Can be 1, 2, 4 or 8.
  For OpenMP or CUDA builds, should be set to 8.  Otherwise should be
  set to 1 (default).
  Currently uses a semiblock tiling method for threading octants,
  different from the production code.

--nsemiblock

  An experimental tuning parameter.  By default equals nthread_octant.

--nthread_e

  For OpenMP or CUDA builds, the number of threads deployed to energy groups
  (default 1).
  The total number of threads equals the product of all thread counts
  along problem axes.

--nthread_y

  For OpenMP or CUDA builds, the number of threads deployed to the Y axis
  within a sweep block (default 1).
  The total number of threads equals the product of all thread counts
  along problem axes.
  For CUDA builds, can be set to a small integer between 1 and 4.
  Not advised for OpenMP builds, as the parallelism is generally too
  fine-grained to give good performance.

--nthread_z

  For OpenMP or CUDA builds, the number of threads deployed to the Z axis
  within a sweep block (default 1).
  The total number of threads equals the product of all thread counts
  along problem axes.
  Since the sweep block thickness in Z (ncell_z/nblock_z) commonly equals 1,
  this setting should generally be set to 1.
  Not advised for OpenMP builds, as the parallelism is generally too
  fine-grained to give good performance.

--ncell_x_per_subblock

  For OpenMP or CUDA builds, a blocking factor for blocking the sweep block
  to deploy Y/Z threading.  By default equals the number of cells along the
  X dimension for the given MPI rank, or half this amount if the axis
  is semiblocked due to octant threading.

--ncell_y_per_subblock

  For OpenMP or CUDA builds, a blocking factor for blocking the sweep block
  to deploy Y/Z threading.  By default equals the number of cells along the
  Y dimension for the given MPI rank, or half this amount if the axis
  is semiblocked due to octant threading.

--ncell_z_per_subblock

  For OpenMP or CUDA builds, a blocking factor for blocking the sweep block
  to deploy Y/Z threading.  By default equals the number of cells along the
  Z dimension for the given MPI rank, or half this amount if the axis
  is semiblocked due to octant threading.
  Since the sweep block thickness in Z (ncell_z/nblock_z) commonly equals 1,
  this setting should generally be set to 1.

Example 1
---------

Usage example on ORNL Titan system, CPU only, for executing a weak scaling
study:
(NOTE: the example shown below has recently changed.)

qsub -I -Astf006 -lnodes=1 -lwalltime=2:0:0
cd $MEMBERWORK/stf006
mkdir minisweep_work
cd minisweep_work
module load git
git clone https://github.com/wdj/minisweep.git
mkdir build
cd build
module swap PrgEnv-pgi PrgEnv-gnu
module load cmake
env BUILD=Release NM_VALUE=16 ../minisweep/scripts/cmake_cray_xk7.sh
make

for nproc_x in {1..2} ; do
for nproc_y in {1..2} ; do
  aprun -n$(( $nproc_x * $nproc_y )) ./sweep \
    --ncell_x  $(( 4 * $nproc_x )) --ncell_y $(( 8 * $nproc_y )) \
    --ncell_z 64 \
    --ne 16 --na 32 --nproc_x $nproc_x --nproc_y $nproc_y --nblock_z 64
done
done

Example 2
---------

Usage example on ORNL Titan system, using GPUs, for executing a weak scaling
study:
(NOTE: the example shown below has recently changed.)

qsub -I -Astf006 -lnodes=4 -lwalltime=2:0:0 -lfeature=gpudefault
cd $MEMBERWORK/stf006
mkdir minisweep_work
cd minisweep_work
module load git
git clone https://github.com/wdj/minisweep.git
mkdir build_cuda
cd build_cuda
module swap PrgEnv-pgi PrgEnv-gnu
module load cmake
module load cudatoolkit
env BUILD=Release NM_VALUE=16 ../minisweep/scripts/cmake_cray_xk7_cuda.sh
make

for nnode_x in {1..2} ; do
for nnode_y in {1..2} ; do
  aprun -n$(( $nnode_x * $nnode_y )) -N1 ./sweep \
    --ncell_x  $(( 16 * $nnode_x )) --ncell_y $(( 32 * $nnode_y )) \
    --ncell_z 64 \
    --ne 16 --na 32 --nproc_x $nnode_x --nproc_y $nnode_y --nblock_z 64 \
    --is_using_device 1 --nthread_octant 8 --nthread_e 16
done
done

Example 3
---------

Usage example on ORNL Titan system, CPU only, for executing a strong scaling
study for cores on a node using OpenMP:

qsub -I -Astf006 -lnodes=1 -lwalltime=2:0:0
cd $MEMBERWORK/stf006
mkdir minisweep_work
cd minisweep_work
module load git
git clone https://github.com/wdj/minisweep.git
mkdir build_openmp
cd build_openmp
module swap PrgEnv-pgi PrgEnv-gnu
module load cmake
env BUILD=Release ../minisweep/scripts/cmake_cray_xk7_openmp.sh
make

for nthread_e in {1..8} ; do
  aprun -n1 -d$nthread_e ./sweep \
    --ncell_x  8 --ncell_y 16 --ncell_z 32 \
    --ne 64 --na 32 --nproc_x 1 --nproc_y 1 --nblock_z 32 --nthread_e $nthread_e
done

Example 4
---------

Usage example on ORNL Crest system (IBM Power8, NVIDIA K40m).

bsub -Is -n 1 -T 1 -W 100 -q interactive bash
mkdir minisweep_work
cd minisweep_work
#git clone https://github.com/wdj/minisweep.git
mkdir build_openmp
cd build_openmp
module load cmake3
module load cuda
env BUILD=RELEASE NM_VALUE=16 ../minisweep/scripts/cmake_cuda.sh
make

poe sweep --ncell_x 16 --ncell_y 32 --ncell_z 64 \
  --ne 16 --na 32 --nblock_z 64 \
  --is_using_device 1 --nthread_octant 8 --nthread_e 16


References
----------

Christopher G. Baker, Gregory G. Davidson, Thomas M. Evans, Steven P.
Hamilton, Joshua J. Jarrell, and Wayne Joubert, "High Performance
Radiation Transport Simulations: Preparing for TITAN," in Proceedings
of Supercomputing Conference SC12,
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6468508.

Wayne Joubert, "Porting the Denovo Radiation Transport Code to Titan:
Lessons Learned," OLCF Titan Workshop 2012,
https://www.olcf.ornl.gov/wp-content/uploads/2012/01/TitanWorkshop2012_Day3_Joubert.pdf.

T. M. Evans, W. Joubert, S. P. Hamilton, S. R. Johnson, J. A Turner,
G. G. Davidson, and T. M Pandya. 2015. "Three-Dimensional Discrete
Ordinates Reactor Assembly Calculations on GPUs," in ANS MC2015 Joint
International Conference on Mathematics and Computation (M&C),
Supercomputing in Nuclear Applications (SNA) and the Monte Carlo (MC)
Method, Nashville, TN, American Nuclear Society, LaGrange Park, 2015,
https://www.ornl.gov/content/three-dimensional-discrete-ordinates-reactor-assembly-calculations-gpus,
http://www.casl.gov/docs/CASL-U-2015-0172-000.pdf.

O.E. Bronson Messer, Ed D'Azevedo, Judy Hill, Wayne Joubert, Mark
Berrill, and Christopher Zimmer, "MiniApps derived from production HPC
applications using multiple programing models," The International
Journal of High Performance Computing Applications, 2016,
http://journals.sagepub.com/doi/abs/10.1177/1094342016668241.