Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle DWS fatal errors #44

Merged
merged 2 commits into from
Sep 22, 2023
Merged

Handle DWS fatal errors #44

merged 2 commits into from
Sep 22, 2023

Conversation

roehrich-hpe
Copy link
Contributor

Set Flags=TeardownFailure in burst_buffer.conf. This tells Slurm to go to Teardown if there are errors during stage_in or stage_out.

In the burst_buffer.lua plugin, adjust slurm_bb_job_teardown() to check the workflow for any fatal errors recorded in the drivers array. If any are found, then we know we were called in a fatal error situation and we not only delete the workflow, per usual, but we also call the Slurm scancel command to cancel the Slurm job.

Set Flags=TeardownFailure in burst_buffer.conf.  This tells Slurm to go to
Teardown if there are errors during stage_in or stage_out.

In the burst_buffer.lua plugin, adjust slurm_bb_job_teardown() to check
the workflow for any fatal errors recorded in the drivers array.  If any
are found, then we know we were called in a fatal error situation and we
not only delete the workflow, per usual, but we also call the Slurm `scancel`
command to cancel the Slurm job.

Signed-off-by: Dean Roehrich <[email protected]>
@github-actions
Copy link

github-actions bot commented Sep 22, 2023

Code Coverage

Package Line Rate Health
burst_buffer 87%
Summary 87% (337 / 387)

@roehrich-hpe roehrich-hpe merged commit bb4fe6c into DataWorkflowServices:main Sep 22, 2023
4 checks passed
@roehrich-hpe roehrich-hpe deleted the dws-fatal-errors branch September 22, 2023 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants