-
Notifications
You must be signed in to change notification settings - Fork 9
Building and running with Trilinos
You can learn more about using the machine by running less /opt/VORTEX_INTRO
after logging in.
- To grab the current selection of modules/Trilinos (with RDC required):
source /projects/empire/installs/vortex/CUDA-10.1.243_GNU-7.3.1_SPMPI-ROLLING-RELEASE-CUDA-STATIC/trilinos/latest/load_matching_env.sh
- This is the build script I use for basic builds
#!/usr/bin/env bash
set +ex
empire=$1
if test $# -eq 0
then
echo "usage: $0 <empire-dir> [ <trace-enabled=0> ] [ <build-type=Release> ] "
exit 1
fi
if test $# -gt 1
then
trace=$2
else
trace=0
fi
if test $# -gt 2
then
build_type=$3
else
build_type=Release
fi
cmake -GNinja -DCMAKE_EXPORT_COMPILE_COMMANDS=true -DEMPIRE_ENABLE_WERROR=OFF -DEMPIRE_ENABLE_PIC=ON -Dvt_trace_enabled=${trace} -DCMAKE_BUILD_TYPE=${build_type} ${empire}
ninja EMPIRE_PIC.exe
While not required, the sysadmins have recommended that we build on a compute node. To do so, run:
lalloc 1
That will drop you directly onto a single compute node, where you can directly call make
or anything else.
To run an interactive job on Vortex with a proper shell run:
bsub -nnodes 16 -Is bash
You will still need to use jsrun
from an interactive allocation as everything run without it will run on the node that launches all jobs.
My primary jsrun
line often looks like:
jsrun -M -gpu -p <num_procs> --rs_per_host=<num_procs_per_node> --gpu_per_rs=1 --cpu_per_rs=10 -b rs --latency_priority=gpu-gpu -d packed <script>
Some MPI performance problems can crop up if you do not turn off CUDA-aware MPI at least in Tpetra. To turn it off in Tpetra, do:
export TPETRA_ASSUME_CUDA_AWARE_MPI=0
You can also turn it off in MPI by omitting -M -gpu
. Do not omit -M -gpu
without exporting as described above.
The scheduler is [IBM LSF](The scheduler is IBM LSF: https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_users_guide/chap_jobs_lsf.html).
To schedule a batch job:
bsub -N -nnodes 16 -W <time_limit> -C 1000000000 -o <stdout_file> -e <stderr_file> <run_script>
The output and error files will only appear after the job has terminated. If you want to know what's happening sooner:
bpeek <job_id>
To see your jobs (both running and pending) summarized, I recommend:
bjobs -o "user: stat: jobid: job_name:25 submit_time: start_time: run_time: time_left: estimated_start_time:"
To see all jobs, add -u all
to the end. If you want to know how wide the running jobs are, it's best to just use:
bjobs -u all
If you schedule multiple jobs and decide not to run them in the order they were submitted, you can move a specific job to the top of your list using:
btop <job_id>
To kill a job, running or pending:
bkill <job_id>
To put a job on hold or release it:
bstop <job_id>
bresume <job_id>
You can learn more about using the machine by running less /opt/MUTRINO_INTRO
after logging in.
- To grab the current selection of modules/Trilinos (with RDC required):
module swap intel/19.0.4 intel/18.0.5
module unload cray-libsci/19.02.1
source /projects/empire/installs/mutrino/INTEL-18.0.5_MPICH-7.7.6-RELEASE-OPENMP-STATIC/trilinos/latest/load_matching_env.sh
module unload cmake/3.9.0
module load cmake/3.14.6
- This is the build script I use for basic builds
#!/usr/bin/env bash
set +ex
empire=$1
if test $# -eq 0
then
echo "usage: $0 <empire-dir> [ <trace-enabled=0> ] [ <build-type=Release> ] "
exit 1
fi
if test $# -gt 1
then
trace=$2
else
trace=0
fi
if test $# -gt 2
then
build_type=$3
else
build_type=Release
fi
srun cmake -DUSE_STANDARD_LINKER=ON -DCMAKE_EXPORT_COMPILE_COMMANDS=true -DEMPIRE_ENABLE_PIC=ON -Dvt_trace_enabled=${trace} -DCMAKE_BUILD_TYPE=${build_type} ${empire}
srun make -j32 EMPIRE_PIC.exe
Note that the srun
before make will build on a compute node, which has the benefit of allowing you to schedule execution as soon as the job successfully completes:
sbatch -d afterok:<make_job_id> <run_script>
The job will get stuck in the queue if your make command fails, so change the dependency using:
scontrol update Dependency=afterok:<new_make_job_id> <run_job_id>
or remove the dependency manually when it's finally built:
scontrol update Dependency= <run_job_id>
Note the space between the equal sign and the next argument.
If you want to build on the head node instead, remove srun
from before the make
command, but not from the cmake
command.
To run an interactive job:
salloc -C haswell -N 64 -t <time_limit> /bin/bash
To submit a batch job, add this (with appropriate modifications) to the top of your script:
#!/bin/bash
#SBATCH -C haswell
#SBATCH --time=8:00:00
#SBATCH --nodes=64
Then run:
sbatch <script>
See the building section for information about holding a job until a dependency is met.
To kill a job, run:
scancel <job_id>
To hold or release a job:
scontrol hold <job_id>
scontrol release <job_id>
To see your jobs, including when they might run, I recommend:
squeue -u <user_id> -o "%.10i %.9P %.8j %.8u %.2t %.10M %.10l %.6D %S %e %R"
To see all jobs, run:
squeue
Skybridge is set up similarly to Mutrino. In the absence of advice to the contrary, follow instructions for Mutrino.
source /projects/empire/installs/skybridge/INTEL-RELEASE-OPENMP-STATIC/trilinos/latest/load_matching_env.sh
source /projects/empire/installs/stria/ARM-20.0_OPENMPI-4.0.2-RELEASE-OPENMP-STATIC/trilinos/latest/load_matching_env.sh