-
Notifications
You must be signed in to change notification settings - Fork 12
/
how_to_run.txt
278 lines (209 loc) · 8.41 KB
/
how_to_run.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
Running the sweep miniapp
-------------------------
Usage
-----
./sweep [ --<setting_name> <setting_value> ] ...
aprun [ <aprun_arg> ] ... sweep [ --<setting_name> <setting_value> ] ...
Settings
--------
--ncell_x
The global number of gridcells along the X dimension.
--ncell_y
The global number of gridcells along the Y dimension.
--ncell_z
The global number of gridcells along the Z dimension.
--ne
The total number of energy groups. For realistic simulations may
range from 1 (small) to 44 (normal) to hundreds (not typical).
--na
The number of angles for each octant direction. For realistic simulations
a typical number would be up to 32 or more.
NOTE: the number of moments is specified as a compile-time value, NM.
Typical values are 1, 4, 16 and 36.
NOTE: for CUDA builds, the angle and moment axes are always fully threaded.
--niterations
The number of sweep iterations to perform. A setting of 1 iteration
is sufficient to demonstrate the performance characteristics of the
code.
--nproc_x
Available for MPI builds. The number of MPI ranks used to decompose
along the X dimension.
--nproc_y
Available for MPI builds. The number of MPI ranks used to decompose
along the Y dimension.
--nblock_z
The number of sweep blocks used to tile the Z dimension. Currently must
divide ncell_z exactly. Blocks along the Z dimension are kept on
the same MPI rank.
The algorithm is a wavefront algorithm, where every block is
considered as a node of the wavefront grid for the wavefront calculation.
NOTE: when nthread_octant==8, setting nblock_z such that nblock_z % 2 == 0
can considerably increase performance.
--is_using_device
Available for CUDA builds. Set to 1 to use the GPU, 0 for
CPU-only (default).
--is_face_comm_async
For MPI builds, 1 to use asynchronous communication (default),
0 for synchronous only.
--nthread_octant
For OpenMP or CUDA builds, the number of threads deployed to octants.
The total number of threads equals the product of all thread counts
along problem axes.
Can be 1, 2, 4 or 8.
For OpenMP or CUDA builds, should be set to 8. Otherwise should be
set to 1 (default).
Currently uses a semiblock tiling method for threading octants,
different from the production code.
--nsemiblock
An experimental tuning parameter. By default equals nthread_octant.
--nthread_e
For OpenMP or CUDA builds, the number of threads deployed to energy groups
(default 1).
The total number of threads equals the product of all thread counts
along problem axes.
--nthread_y
For OpenMP or CUDA builds, the number of threads deployed to the Y axis
within a sweep block (default 1).
The total number of threads equals the product of all thread counts
along problem axes.
For CUDA builds, can be set to a small integer between 1 and 4.
Not advised for OpenMP builds, as the parallelism is generally too
fine-grained to give good performance.
--nthread_z
For OpenMP or CUDA builds, the number of threads deployed to the Z axis
within a sweep block (default 1).
The total number of threads equals the product of all thread counts
along problem axes.
Since the sweep block thickness in Z (ncell_z/nblock_z) commonly equals 1,
this setting should generally be set to 1.
Not advised for OpenMP builds, as the parallelism is generally too
fine-grained to give good performance.
--ncell_x_per_subblock
For OpenMP or CUDA builds, a blocking factor for blocking the sweep block
to deploy Y/Z threading. By default equals the number of cells along the
X dimension for the given MPI rank, or half this amount if the axis
is semiblocked due to octant threading.
--ncell_y_per_subblock
For OpenMP or CUDA builds, a blocking factor for blocking the sweep block
to deploy Y/Z threading. By default equals the number of cells along the
Y dimension for the given MPI rank, or half this amount if the axis
is semiblocked due to octant threading.
--ncell_z_per_subblock
For OpenMP or CUDA builds, a blocking factor for blocking the sweep block
to deploy Y/Z threading. By default equals the number of cells along the
Z dimension for the given MPI rank, or half this amount if the axis
is semiblocked due to octant threading.
Since the sweep block thickness in Z (ncell_z/nblock_z) commonly equals 1,
this setting should generally be set to 1.
Example 1
---------
Usage example on ORNL Titan system, CPU only, for executing a weak scaling
study:
(NOTE: the example shown below has recently changed.)
qsub -I -Astf006 -lnodes=1 -lwalltime=2:0:0
cd $MEMBERWORK/stf006
mkdir minisweep_work
cd minisweep_work
module load git
git clone https://github.com/wdj/minisweep.git
mkdir build
cd build
module swap PrgEnv-pgi PrgEnv-gnu
module load cmake
env BUILD=Release NM_VALUE=16 ../minisweep/scripts/cmake_cray_xk7.sh
make
for nproc_x in {1..2} ; do
for nproc_y in {1..2} ; do
aprun -n$(( $nproc_x * $nproc_y )) ./sweep \
--ncell_x $(( 4 * $nproc_x )) --ncell_y $(( 8 * $nproc_y )) \
--ncell_z 64 \
--ne 16 --na 32 --nproc_x $nproc_x --nproc_y $nproc_y --nblock_z 64
done
done
Example 2
---------
Usage example on ORNL Titan system, using GPUs, for executing a weak scaling
study:
(NOTE: the example shown below has recently changed.)
qsub -I -Astf006 -lnodes=4 -lwalltime=2:0:0 -lfeature=gpudefault
cd $MEMBERWORK/stf006
mkdir minisweep_work
cd minisweep_work
module load git
git clone https://github.com/wdj/minisweep.git
mkdir build_cuda
cd build_cuda
module swap PrgEnv-pgi PrgEnv-gnu
module load cmake
module load cudatoolkit
env BUILD=Release NM_VALUE=16 ../minisweep/scripts/cmake_cray_xk7_cuda.sh
make
for nnode_x in {1..2} ; do
for nnode_y in {1..2} ; do
aprun -n$(( $nnode_x * $nnode_y )) -N1 ./sweep \
--ncell_x $(( 16 * $nnode_x )) --ncell_y $(( 32 * $nnode_y )) \
--ncell_z 64 \
--ne 16 --na 32 --nproc_x $nnode_x --nproc_y $nnode_y --nblock_z 64 \
--is_using_device 1 --nthread_octant 8 --nthread_e 16
done
done
Example 3
---------
Usage example on ORNL Titan system, CPU only, for executing a strong scaling
study for cores on a node using OpenMP:
qsub -I -Astf006 -lnodes=1 -lwalltime=2:0:0
cd $MEMBERWORK/stf006
mkdir minisweep_work
cd minisweep_work
module load git
git clone https://github.com/wdj/minisweep.git
mkdir build_openmp
cd build_openmp
module swap PrgEnv-pgi PrgEnv-gnu
module load cmake
env BUILD=Release ../minisweep/scripts/cmake_cray_xk7_openmp.sh
make
for nthread_e in {1..8} ; do
aprun -n1 -d$nthread_e ./sweep \
--ncell_x 8 --ncell_y 16 --ncell_z 32 \
--ne 64 --na 32 --nproc_x 1 --nproc_y 1 --nblock_z 32 --nthread_e $nthread_e
done
Example 4
---------
Usage example on ORNL Crest system (IBM Power8, NVIDIA K40m).
bsub -Is -n 1 -T 1 -W 100 -q interactive bash
mkdir minisweep_work
cd minisweep_work
#git clone https://github.com/wdj/minisweep.git
mkdir build_openmp
cd build_openmp
module load cmake3
module load cuda
env BUILD=RELEASE NM_VALUE=16 ../minisweep/scripts/cmake_cuda.sh
make
poe sweep --ncell_x 16 --ncell_y 32 --ncell_z 64 \
--ne 16 --na 32 --nblock_z 64 \
--is_using_device 1 --nthread_octant 8 --nthread_e 16
References
----------
Christopher G. Baker, Gregory G. Davidson, Thomas M. Evans, Steven P.
Hamilton, Joshua J. Jarrell, and Wayne Joubert, "High Performance
Radiation Transport Simulations: Preparing for TITAN," in Proceedings
of Supercomputing Conference SC12,
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6468508.
Wayne Joubert, "Porting the Denovo Radiation Transport Code to Titan:
Lessons Learned," OLCF Titan Workshop 2012,
https://www.olcf.ornl.gov/wp-content/uploads/2012/01/TitanWorkshop2012_Day3_Joubert.pdf.
T. M. Evans, W. Joubert, S. P. Hamilton, S. R. Johnson, J. A Turner,
G. G. Davidson, and T. M Pandya. 2015. "Three-Dimensional Discrete
Ordinates Reactor Assembly Calculations on GPUs," in ANS MC2015 Joint
International Conference on Mathematics and Computation (M&C),
Supercomputing in Nuclear Applications (SNA) and the Monte Carlo (MC)
Method, Nashville, TN, American Nuclear Society, LaGrange Park, 2015,
https://www.ornl.gov/content/three-dimensional-discrete-ordinates-reactor-assembly-calculations-gpus,
http://www.casl.gov/docs/CASL-U-2015-0172-000.pdf.
O.E. Bronson Messer, Ed D'Azevedo, Judy Hill, Wayne Joubert, Mark
Berrill, and Christopher Zimmer, "MiniApps derived from production HPC
applications using multiple programing models," The International
Journal of High Performance Computing Applications, 2016,
http://journals.sagepub.com/doi/abs/10.1177/1094342016668241.