title | author |
---|---|
Exercise 05 - Batch computing - running a script on LOTUS |
Ag Stephens |
Having established (in exercise 4) that I can extract the total cloud cover (TCC
) variable from a single ERA-Interim file I now wish to extract that data from an entire month. I will write some simple scripts to batch up separate processes that run CDO to extract the TCC
variable from a series of ERA-Interim files. Each run of the script will loop through 4 x 6-hourly files for one day. I will run it 30 times, once for each day in September 2018. Each run will be submitted to the LOTUS cluster.
After completing this exercise I will be able to:
- write scripts to batch up tasks
- submit scripts to the LOTUS cluster
- JASMIN account with SSH public key uploaded and
jasmin-login
privilege - login servers:
login2.jasmin.ac.uk
- sci servers:
sci[1-8].jasmin.ac.uk
- LOTUS batch processing cluster
- common software: CDO (Climate Data Operators) tool
- GWS (read/write):
/gws/pw/j07/workshop
- CEDA Archive (read-only): requires a CEDA account
- help documentation at https://help.jasmin.ac.uk
- SSH client (to login to JASMIN)
You can follow this exercise by watching the videos below, or by following the text of this article, or a combination of both.
Task | |
Solutions & Discussion |
This is the outline of what you need to do. The recommended way of doing each step is covered in the "Cheat Sheet" but you may wish to try solving it for yourself first.
- Your starting point is on a JASMIN
login
server (see exercise 01) - SSH to a scientific analysis server
- Write an "
extract-era-data.sh
" wrapper script that calls the CDO extraction command - Write a script, called "
submit-all.sh
", to loop over dates from 01/09/2018 to 02/09/2018 and submit the "extract-era-data.sh
" script to LOTUS for each day - Run the "
submit-all.sh
" script - Examine which jobs are in the queue
- Examine the standard output and standard error files
- Modify "
submit-all.sh
" so that it will run for all 30 days in September 2018 - Re-run the "
submit-all.sh
" script - Examine which jobs are in the queue
- Kill one of the jobs - just to see how it is done
All too easy? Here are some questions to test your knowledge an understanding. You might find the answers by exploring the JASMIN Documentation
- You have learnt about some basic commands to interact with SLURM scheduler (such as
sbatch
andsqueue
). This manages the submission and execution of jobs via the LOTUS queues. Which other commands might be useful when interacting with the scheduler? - Which queues are available on LOTUS? What is the difference between them? Why would you choose one over another?
- How can you instruct SLURM to allocate CPUs and memory to specific jobs when you run them? Can you change the allocations when the job is queuing?
- How can you cancel all your jobs in the SLURM queue?
This exercise demonstrates how to:
- Create a script that takes an argument to process a single component (day) of an overall task.
- Create a wrapper script that loops through all the components that need to be processed.
- Submit each component as a LOTUS job using the
sbatch
command. - Define the command-line arguments for the
sbatch
command. - Use other SLURM commands, such as
squeue
(to monitor progress) andscancel
(to cancel jobs).
Alternative approaches could include:
-
Write the output to a
scratch
directory- There are two main scenarios in which you might write the output to a scratch directory:
- You only need to store the output file for temporary use (such as intermediate files in your workflow).
- You want to write outputs to scratch before moving them to a GWS.
- The Help page (https://help.jasmin.ac.uk/article/176-storage#diskmount) tells us that there are two types of scratch space:
/work/scratch-pw2
– supports parallel writes/work/scratch-nopw2
– does NOT support parallel writes
- Since we do not need parallel write capability, we can use the "
nopw
" version. - You need to set up a directory under "
/work/scratch-nopw2"
as your username:
MYSCRATCH=/work/scratch-nopw2/$USER mkdir -p $MYSCRATCH
- Then you would write output files/directories under your scratch space, e.g.:
OUTPUT_FILE=$MYSCRATCH/output.nc ...some_process... > $OUTPUT_FILE
- When you have finished with the file, tidy up (good practice).
rm $OUTPUT_FILE
- Do not leave data on the "scratch" areas when you have finished your workflow. 1. Please remove any temporary files/directories that you have created. 1. You cannot rely on the data persisting in the "scratch" areas.
- There are two main scenarios in which you might write the output to a scratch directory:
-
Specify the memory requirements of your job:
- If your job has a significant memory footprint:
- Run a single iteration on LOTUS and review the standard output file to examine the memory usage.
- You can then reserve a memory allocation when you submit your subsequent jobs.
- If your job has a significant memory footprint:
This demonstrates best practice:
-
Build up in stages before running your full workflow on LOTUS:
- Check your code - is it really doing what you think it is doing?
- Run locally (on a
sci
server) for one iteration. - Run for one or two iterations on LOTUS.
- Check everything ran correctly on LOTUS.
- Submit your full batch of jobs to LOTUS.
-
Have any files been accidentally left on the system? (E.g. in
/tmp/
):- It is important to clean up any temporary files that you no longer need.
- Please check whether the tools you use have left any files in "
/tmp/
".
-
Your starting point is on a JASMIN
login
server (see exercise 01) -
SSH to a scientific analysis server
ssh sci5 # Could use any of sci[1-8]
-
Write an "
extract-era-data.sh
" wrapper script that calls the CDO extraction command, that:-
Takes a date string ("
YYYYMMDD
") as a command-line argument -
Locates the 4 x 6-hourly input file paths for the date provided
-
Activates environment containing the CDO tool
-
For each 6-hourly file:
- Defines the output file path
- Run the CDO tool to extract the "TCC" variable from the input file to the output file
-
If you are stuck, you can use the script located at:
/gws/pw/j07/workshop/exercises/ex05/code/extract-era-data.sh
[ Source: https://github.com/cedadev/jasmin-workshop/blob/master/exercises/ex05/code/extract-era-data.sh ]
-
-
Write a script, called "
submit-all.sh
", to loop over dates from 01/09/2018 to 02/09/2018 and submit the "extract-era-data.sh
" script to LOTUS for each day:-
You should define the following LOTUS directives:
- Standard output file - please ensure this is unique to each job by including the "
%j
" variable in the file name. - Standard error file - please ensure this is unique to each job by including the "
%j
" variable in the file name.
- Standard output file - please ensure this is unique to each job by including the "
-
Queue name:
- We will use the main queue for quick serial jobs:
short-serial
- NOTE: if working with a training account, you might need:
--account=workshop --partition=workshop
in your arguments.
- We will use the main queue for quick serial jobs:
-
Job duration - to allocate a maximum run-time to the job, e.g.: "
00:05
" (5 mins) -
Estimated duration - to hint the actual run-time of the job, e.g.: "
00:01
" (1 min)- Setting a low estimate will increase the likelihood of the job being scheduled to run quickly.
-
The Help page on submitting LOTUS jobs is here: https://help.jasmin.ac.uk/article/4890-how-to-submit-a-job-to-slurm
-
And use the "
sbatch
" command to submit each job. -
If you need some advice you can use the script at:
/gws/pw/j07/workshop/exercises/ex05/code/submit-all.sh
[ Source: https://github.com/cedadev/jasmin-workshop/blob/master/exercises/ex05/code/submit-all.sh ]
-
-
Run the "
submit-all.sh
" script -
Examine which jobs are in the queue
- Type "
squeue
" to review any running jobs.
- Type "
-
Examine the standard output and standard error files.
-
If you are happy that the job is doing the right thing, now modify "
submit-all.sh
" so that it will run for all 30 days in September 2018. -
Re-run the "
submit-all.sh
" script. -
Examine which jobs are in the queue
-
Kill one of the jobs whilst it is still running - just to see how it is done:
-
Use the "
scancel
" command:scancel <job_id>
-
- You have learnt about some basic commands to interact with SLURM scheduler (such as
sbatch
andsqueue
). This manages the submission and execution of jobs via the LOTUS queues. Which other commands might be useful when interacting with the scheduler?
Table 3 of this help page shows other SLURM commands, such as scancel
and scontrol
. You can find out more by typing man <command>
at the command-line, e.g.: man scancel
.
- Which queues are available on LOTUS? What is the difference between them? Why would you choose one over another?
There is a LOTUS queues help page which explains the capabilities of each SLURM queue.
- How can you instruct SLURM to allocate CPUs and memory to specific jobs when you run them?
Table 2 of this help page lists common command-line parameters that can be used to instruct SLURM how to allocate CPUs, memory and hosts to certain jobs.
- How can you cancel all your jobs in the SLURM queue?
The following command will do it:
scancel -u $USER