Ensure split jobs stop if a proceeding job fails #27

tsherwen · 2022-06-06T09:18:58Z

Currently, split jobs can be submitted with a dependency to proceed if the preceding job is completed successfully. However, when jobs exit with a model crash the rest of the queued jobs submitted together are proceeding.

This is [create_SLURM_run_script2submit_together](https://github.com/wacl-york/geos-chem-schedule/blob/main/core.py#L1033-L1065) which uses the SLURM option --dependency=afterok.

TODO: work out how to capture all the job/model fail codes via SLURM and then abort the following model runs in the queue.

def create_SLURM_run_script2submit_together(times):
    """
    Create the script that can set the 1st scheduled job running
    Parameters
    -------
    time (str): string time to run job script for in the format YYYYMMDD
    Returns
    -------
    (None)
    """
    print(times)
    FileName = 'run_geos_SLURM_queue_all_jobs.sh'
    run_script = open(FileName, 'w')
    Line0 = "#!/bin/bash \n"
    Line1 = """job_num_{time}=$(sbatch --parsable SLURM_queue_files/{time}.sbatch) \n"""
    Line2 = """echo "$job_num_{time}" \n"""
    Line3 = """job_num_{time2}=$(sbatch --parsable --dependency=afterok:"$job_num_{time1}" SLURM_queue_files/{time2}.sbatch) \n"""
    for n_time, time in enumerate(times[:-1]):
        #
        if time == times[0]:
            run_script.write(Line0)
            run_script.write(Line1.format(time=time))
            run_script.write(Line2.format(time=time))
        else:
            run_script.write(Line3.format(time1=times[n_time-1], time2=time))
            run_script.write(Line2.format(time=time))
    run_script.close()
    # Change the permissions so it is executable
    st = os.stat(FileName)
    os.chmod(FileName, st.st_mode | stat.S_IEXEC)
    return

Example emailed error codes are (1) and a model run abort output of (2).

(1) Slurm Job_id=17908287 Name=Iso.UnlimAll.2 Ended, Run time 1-09:03:48, COMPLETED, ExitCode 0

(2)

---> DATE: 2018/06/05  UTC: 09:30  X-HRS:   3729.500000
===============================================================================
WETDEP: ERROR at   42  23  71 for species  128 in area RESUSPENSION in middle levels
 LS          :  T
 PDOWN       :   0.000000000000000E+000
 QQ          :   0.000000000000000E+000
 ALPHA       :   0.000000000000000E+000
 ALPHA2      :   0.000000000000000E+000
 RAINFRAC    :   0.000000000000000E+000
 WASHFRAC    :   0.000000000000000E+000
 MASS_WASH   :   0.000000000000000E+000
 MASS_NOWASH :   0.000000000000000E+000
 WETLOSS     :   0.000000000000000E+000
 GAINED      :   0.000000000000000E+000
 LOST        :   0.000000000000000E+000
 DSpc(NW,:)  :   0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000
 Spc(I,J,:N) :   0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000 -1.198463418495579E-013
 -3.378856226737475E-014 -2.314320910996734E-015 -3.019816125666191E-017
 -2.520870547180549E-019 -2.822882195877689E-020 -1.445442763521664E-020
 -6.900126364332953E-022 -8.767709715535021E-024 -4.400381547465612E-025
 -1.720033035554861E-026 -4.865030418189842E-028 -1.209928241506788E-029
 -2.093866091759672E-031 -5.970450354644719E-033 -5.162779587092464E-035
 -3.748736158226715E-037 -6.734895733485428E-040 -1.499747125810236E-039
 -1.842168830893617E-038 -1.148225302626263E-037 -8.781446960823125E-037
 -5.560642728382409E-034 -3.794567594400997E-031 -1.821084089043769E-029
 -4.488053629617312E-028 -1.680343276761365E-026 -1.155873919999522E-024
 -3.060986958260714E-022 -1.313308483959904E-021 -1.574449846171001E-021
 -1.338100164273038E-021 -6.956426779262325E-022 -3.506180178581743E-022
 -6.266069886329827E-022 -1.353555184532673E-021 -4.106777266211684E-021
 -1.014640714706731E-020 -1.134631269064776E-020 -9.833174509975247E-021
 -7.707150917708346E-021
===============================================================================
===============================================================================
GEOS-Chem ERROR: Error encountered in wet deposition!
 -> at SAFETY (in module GeosCore/wetscav_mod.F90)
===============================================================================

===============================================================================
GEOS-Chem ERROR: Error encountered in "Safety"!
 -> at Do_Complete_Reevap (in module GeosCore/wetscav_mod.F90)
===============================================================================

===============================================================================
GEOS-Chem ERROR:
 -> at WetDep (in module GeosCore/wetscav_mod.F90)
===============================================================================

===============================================================================
GEOS-Chem ERROR: Error encountered in "Wetdep"!
 -> at Do_WetDep (in module GeosCore/wetscav_mod.F90)
===============================================================================

===============================================================================
GEOS-CHEM ERROR: Error encountered in "Do_WetDep"!
STOP at  -> at GEOS-Chem (in GeosCore/main.F90)
===============================================================================
srun: error: node112: task 0: Exited with exit code 159

The text was updated successfully, but these errors were encountered:

tsherwen added the bug label Jun 6, 2022

tsherwen assigned tsherwen and matt-rowlinson Jun 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure split jobs stop if a proceeding job fails #27

Ensure split jobs stop if a proceeding job fails #27

tsherwen commented Jun 6, 2022

Ensure split jobs stop if a proceeding job fails #27

Ensure split jobs stop if a proceeding job fails #27

Comments

tsherwen commented Jun 6, 2022