You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, split jobs can be submitted with a dependency to proceed if the preceding job is completed successfully. However, when jobs exit with a model crash the rest of the queued jobs submitted together are proceeding.
This is [create_SLURM_run_script2submit_together](https://github.com/wacl-york/geos-chem-schedule/blob/main/core.py#L1033-L1065) which uses the SLURM option --dependency=afterok.
TODO: work out how to capture all the job/model fail codes via SLURM and then abort the following model runs in the queue.
def create_SLURM_run_script2submit_together(times):
"""
Create the script that can set the 1st scheduled job running
Parameters
-------
time (str): string time to run job script for in the format YYYYMMDD
Returns
-------
(None)
"""
print(times)
FileName = 'run_geos_SLURM_queue_all_jobs.sh'
run_script = open(FileName, 'w')
Line0 = "#!/bin/bash \n"
Line1 = """job_num_{time}=$(sbatch --parsable SLURM_queue_files/{time}.sbatch) \n"""
Line2 = """echo "$job_num_{time}" \n"""
Line3 = """job_num_{time2}=$(sbatch --parsable --dependency=afterok:"$job_num_{time1}" SLURM_queue_files/{time2}.sbatch) \n"""
for n_time, time in enumerate(times[:-1]):
#
if time == times[0]:
run_script.write(Line0)
run_script.write(Line1.format(time=time))
run_script.write(Line2.format(time=time))
else:
run_script.write(Line3.format(time1=times[n_time-1], time2=time))
run_script.write(Line2.format(time=time))
run_script.close()
# Change the permissions so it is executable
st = os.stat(FileName)
os.chmod(FileName, st.st_mode | stat.S_IEXEC)
return
Example emailed error codes are (1) and a model run abort output of (2).
(1) Slurm Job_id=17908287 Name=Iso.UnlimAll.2 Ended, Run time 1-09:03:48, COMPLETED, ExitCode 0
Currently, split jobs can be submitted with a dependency to proceed if the preceding job is completed successfully. However, when jobs exit with a model crash the rest of the queued jobs submitted together are proceeding.
This is
[create_SLURM_run_script2submit_together](https://github.com/wacl-york/geos-chem-schedule/blob/main/core.py#L1033-L1065)
which uses the SLURM option--dependency=afterok
.TODO: work out how to capture all the job/model fail codes via SLURM and then abort the following model runs in the queue.
Example emailed error codes are (1) and a model run abort output of (2).
(1)
Slurm Job_id=17908287 Name=Iso.UnlimAll.2 Ended, Run time 1-09:03:48, COMPLETED, ExitCode 0
(2)
The text was updated successfully, but these errors were encountered: