Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure split jobs stop if a proceeding job fails #27

Open
tsherwen opened this issue Jun 6, 2022 · 0 comments
Open

Ensure split jobs stop if a proceeding job fails #27

tsherwen opened this issue Jun 6, 2022 · 0 comments
Assignees
Labels

Comments

@tsherwen
Copy link
Member

tsherwen commented Jun 6, 2022

Currently, split jobs can be submitted with a dependency to proceed if the preceding job is completed successfully. However, when jobs exit with a model crash the rest of the queued jobs submitted together are proceeding.

This is [create_SLURM_run_script2submit_together](https://github.com/wacl-york/geos-chem-schedule/blob/main/core.py#L1033-L1065) which uses the SLURM option --dependency=afterok.

TODO: work out how to capture all the job/model fail codes via SLURM and then abort the following model runs in the queue.

def create_SLURM_run_script2submit_together(times):
    """
    Create the script that can set the 1st scheduled job running
    Parameters
    -------
    time (str): string time to run job script for in the format YYYYMMDD
    Returns
    -------
    (None)
    """
    print(times)
    FileName = 'run_geos_SLURM_queue_all_jobs.sh'
    run_script = open(FileName, 'w')
    Line0 = "#!/bin/bash \n"
    Line1 = """job_num_{time}=$(sbatch --parsable SLURM_queue_files/{time}.sbatch) \n"""
    Line2 = """echo "$job_num_{time}" \n"""
    Line3 = """job_num_{time2}=$(sbatch --parsable --dependency=afterok:"$job_num_{time1}" SLURM_queue_files/{time2}.sbatch) \n"""
    for n_time, time in enumerate(times[:-1]):
        #
        if time == times[0]:
            run_script.write(Line0)
            run_script.write(Line1.format(time=time))
            run_script.write(Line2.format(time=time))
        else:
            run_script.write(Line3.format(time1=times[n_time-1], time2=time))
            run_script.write(Line2.format(time=time))
    run_script.close()
    # Change the permissions so it is executable
    st = os.stat(FileName)
    os.chmod(FileName, st.st_mode | stat.S_IEXEC)
    return

Example emailed error codes are (1) and a model run abort output of (2).

(1) Slurm Job_id=17908287 Name=Iso.UnlimAll.2 Ended, Run time 1-09:03:48, COMPLETED, ExitCode 0

(2)

---> DATE: 2018/06/05  UTC: 09:30  X-HRS:   3729.500000
===============================================================================
WETDEP: ERROR at   42  23  71 for species  128 in area RESUSPENSION in middle levels
 LS          :  T
 PDOWN       :   0.000000000000000E+000
 QQ          :   0.000000000000000E+000
 ALPHA       :   0.000000000000000E+000
 ALPHA2      :   0.000000000000000E+000
 RAINFRAC    :   0.000000000000000E+000
 WASHFRAC    :   0.000000000000000E+000
 MASS_WASH   :   0.000000000000000E+000
 MASS_NOWASH :   0.000000000000000E+000
 WETLOSS     :   0.000000000000000E+000
 GAINED      :   0.000000000000000E+000
 LOST        :   0.000000000000000E+000
 DSpc(NW,:)  :   0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000
 Spc(I,J,:N) :   0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000 -1.198463418495579E-013
 -3.378856226737475E-014 -2.314320910996734E-015 -3.019816125666191E-017
 -2.520870547180549E-019 -2.822882195877689E-020 -1.445442763521664E-020
 -6.900126364332953E-022 -8.767709715535021E-024 -4.400381547465612E-025
 -1.720033035554861E-026 -4.865030418189842E-028 -1.209928241506788E-029
 -2.093866091759672E-031 -5.970450354644719E-033 -5.162779587092464E-035
 -3.748736158226715E-037 -6.734895733485428E-040 -1.499747125810236E-039
 -1.842168830893617E-038 -1.148225302626263E-037 -8.781446960823125E-037
 -5.560642728382409E-034 -3.794567594400997E-031 -1.821084089043769E-029
 -4.488053629617312E-028 -1.680343276761365E-026 -1.155873919999522E-024
 -3.060986958260714E-022 -1.313308483959904E-021 -1.574449846171001E-021
 -1.338100164273038E-021 -6.956426779262325E-022 -3.506180178581743E-022
 -6.266069886329827E-022 -1.353555184532673E-021 -4.106777266211684E-021
 -1.014640714706731E-020 -1.134631269064776E-020 -9.833174509975247E-021
 -7.707150917708346E-021
===============================================================================
===============================================================================
GEOS-Chem ERROR: Error encountered in wet deposition!
 -> at SAFETY (in module GeosCore/wetscav_mod.F90)
===============================================================================

===============================================================================
GEOS-Chem ERROR: Error encountered in "Safety"!
 -> at Do_Complete_Reevap (in module GeosCore/wetscav_mod.F90)
===============================================================================

===============================================================================
GEOS-Chem ERROR:
 -> at WetDep (in module GeosCore/wetscav_mod.F90)
===============================================================================

===============================================================================
GEOS-Chem ERROR: Error encountered in "Wetdep"!
 -> at Do_WetDep (in module GeosCore/wetscav_mod.F90)
===============================================================================

===============================================================================
GEOS-CHEM ERROR: Error encountered in "Do_WetDep"!
STOP at  -> at GEOS-Chem (in GeosCore/main.F90)
===============================================================================
srun: error: node112: task 0: Exited with exit code 159

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants