Skip to content

Building and running with Trilinos

Nicole Slattengren edited this page Aug 14, 2020 · 13 revisions

Vortex

You can learn more about using the machine by running less /opt/VORTEX_INTRO after logging in.

Building

  1. To grab the current selection of modules/Trilinos (with RDC required):
source /projects/empire/installs/vortex/CUDA-10.1.243_GNU-7.3.1_SPMPI-ROLLING-RELEASE-CUDA-STATIC/trilinos/latest/load_matching_env.sh
  1. This is the build script I use for basic builds
#!/usr/bin/env bash

set +ex

empire=$1

if test $# -eq 0
then
    echo "usage: $0 <empire-dir> [ <trace-enabled=0> ] [ <build-type=Release> ] "
    exit 1
fi


if test $# -gt 1
then
    trace=$2
else
    trace=0
fi

if test $# -gt 2
then
    build_type=$3
else
    build_type=Release
fi

cmake -GNinja -DCMAKE_EXPORT_COMPILE_COMMANDS=true -DEMPIRE_ENABLE_WERROR=OFF -DEMPIRE_ENABLE_PIC=ON -Dvt_trace_enabled=${trace} -DCMAKE_BUILD_TYPE=${build_type} ${empire}
ninja EMPIRE_PIC.exe

While not required, the sysadmins have recommended that we build on a compute node. To do so, run:

lalloc 1

That will drop you directly onto a single compute node, where you can directly call make or anything else.

Running

To run an interactive job on Vortex with a proper shell run:

bsub -nnodes 16 -Is bash

You will still need to use jsrun from an interactive allocation as everything run without it will run on the node that launches all jobs.

My primary jsrun line often looks like:

jsrun -M -gpu -p <num_procs> --rs_per_host=<num_procs_per_node> --gpu_per_rs=1 --cpu_per_rs=10 -b rs --latency_priority=gpu-gpu -d packed <script>

Some MPI performance problems can crop up if you do not turn off CUDA-aware MPI at least in Tpetra. To turn it off in Tpetra, do:

export TPETRA_ASSUME_CUDA_AWARE_MPI=0

You can also turn it off in MPI by omitting -M -gpu. Do not omit -M -gpu without exporting as described above.

Scheduling

The scheduler is [IBM LSF](The scheduler is IBM LSF: https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_users_guide/chap_jobs_lsf.html).

To schedule a batch job:

bsub -N -nnodes 16 -W <time_limit> -C 1000000000 -o <stdout_file> -e <stderr_file> <run_script>

The output and error files will only appear after the job has terminated. If you want to know what's happening sooner:

bpeek <job_id>

To see your jobs (both running and pending) summarized, I recommend:

bjobs -o "user: stat: jobid: job_name:25 submit_time: start_time: run_time: time_left: estimated_start_time:"

To see all jobs, add -u all to the end. If you want to know how wide the running jobs are, it's best to just use:

bjobs -u all

If you schedule multiple jobs and decide not to run them in the order they were submitted, you can move a specific job to the top of your list using:

btop <job_id>

To kill a job, running or pending:

bkill <job_id>

To put a job on hold or release it:

bstop <job_id>
bresume <job_id>

Mutrino

You can learn more about using the machine by running less /opt/MUTRINO_INTRO after logging in.

Building

  1. To grab the current selection of modules/Trilinos (with RDC required):
module swap intel/19.0.4 intel/18.0.5
module unload cray-libsci/19.02.1
source /projects/empire/installs/mutrino/INTEL-18.0.5_MPICH-7.7.6-RELEASE-OPENMP-STATIC/trilinos/latest/load_matching_env.sh
module unload cmake/3.9.0
module load cmake/3.14.6
  1. This is the build script I use for basic builds
#!/usr/bin/env bash

set +ex

empire=$1

if test $# -eq 0
then
    echo "usage: $0 <empire-dir> [ <trace-enabled=0> ] [ <build-type=Release> ] "
    exit 1
fi


if test $# -gt 1
then
    trace=$2
else
    trace=0
fi

if test $# -gt 2
then
    build_type=$3
else
    build_type=Release
fi

srun cmake -DUSE_STANDARD_LINKER=ON -DCMAKE_EXPORT_COMPILE_COMMANDS=true -DEMPIRE_ENABLE_PIC=ON -Dvt_trace_enabled=${trace} -DCMAKE_BUILD_TYPE=${build_type} ${empire}
srun make -j32 EMPIRE_PIC.exe

Note that the srun before make will build on a compute node, which has the benefit of allowing you to schedule execution as soon as the job successfully completes:

sbatch -d afterok:<make_job_id> <run_script>

The job will get stuck in the queue if your make command fails, so change the dependency using:

scontrol update Dependency=afterok:<new_make_job_id> <run_job_id>

or remove the dependency manually when it's finally built:

scontrol update Dependency=   <run_job_id>

Note the space between the equal sign and the next argument.

If you want to build on the head node instead, remove srun from before the make command, but not from the cmake command.

Running

To run an interactive job:

salloc -C haswell -N 64 -t <time_limit> /bin/bash

Scheduling

To submit a batch job, add this (with appropriate modifications) to the top of your script:

#!/bin/bash

#SBATCH -C haswell
#SBATCH --time=8:00:00
#SBATCH --nodes=64

Then run:

sbatch <script>

See the building section for information about holding a job until a dependency is met.

To kill a job, run:

scancel <job_id>

To hold or release a job:

scontrol hold <job_id>
scontrol release <job_id>

To see your jobs, including when they might run, I recommend:

squeue -u <user_id> -o "%.10i %.9P %.8j %.8u %.2t %.10M %.10l %.6D %S %e %R"

To see all jobs, run:

squeue

Skybridge

Skybridge is set up similarly to Mutrino. In the absence of advice to the contrary, follow instructions for Mutrino.

Building

source /projects/empire/installs/skybridge/INTEL-RELEASE-OPENMP-STATIC/trilinos/latest/load_matching_env.sh

Stria

Building

source /projects/empire/installs/stria/ARM-20.0_OPENMPI-4.0.2-RELEASE-OPENMP-STATIC/trilinos/latest/load_matching_env.sh