Adding new choice to --on-error #1974

AlexTate · 2024-02-04T00:41:28Z

Summary

This pull request introduces a new choice, kill, for the --on-error parameter.

Motivation

There currently isn't a way to have cwltool immediately stop parallel jobs when one of them fails. One might expect --on-error stop to accomplish this, but the help string is specific and accurate: "do not submit any more steps". Since scatter and subworkflow are treated as single "steps" within the parent workflow, this means cwltool is not wrong to wait for the rest of the step's parallel jobs to finish when --on-error stop. However, sometimes individual scatter jobs take a long time to complete, so if one of them fails early on, cwltool might wait great lengths of time for the other scatter jobs to complete before terminating the workflow. With --on-error kill, all running jobs are quickly notified and self-terminate upon one job's failure.

Demonstration of the Issue

When running the following workflow with cwltool --parallel --on-error stop, the total runtime is ~33 seconds despite one of the scatterstep tasks terminating unexpectedly. Ideally the workflow would terminate immediately. --on-error kill accomplishes that.

#!/usr/bin/env cwl-runner

class: Workflow
cwlVersion: v1.2

inputs:
  sleeptime:
    type: int[]
    default: [ 33, 33, 33, 33, 33 ]
outputs: { }
requirements:
  - class: ScatterFeatureRequirement

steps:
  scatterstep:
    in: { sleeptime: sleeptime }
    out: [ ]
    scatter: sleeptime
    run:
      class: CommandLineTool
      baseCommand: sleep
      inputs:
        sleeptime: { type: int, inputBinding: { position: 1 } }
      outputs: { }
  kill:
    in: { }
    out: [ ]
    run:
      class: CommandLineTool
      baseCommand: [ 'bash', '-c' ]
      arguments:
        - |
          # Wait 1 second for scatter to spin up then select a random sleep process to kill
          sleep 1
          ps -ef | grep 'sleep 33' | grep -v grep | awk '{print $2}' | shuf | head -n 1 | xargs kill -9
      inputs: { }
      outputs: { }

Forum Post

https://cwl.discourse.group/t/how-to-fail-fast-during-parallel-scatter/868

Concerns

workflow_eval_lock.release() had to be moved to the finally block in MultithreadedJobExecutor.run_jobs()
Are any important steps skipped in JobBase._execute() due to if runtimeContext.kill_switch.is_set(): return? For that matter, shouldn't there be a finally block to contain some of these steps such as deleting runtime-generated files containing secrets? Update Nov 13: these post-subprocess tasks were moved into a finally block in JobBase._execute() to ensure that they aren't skipped by jobs setting or responding to the kill switch. See abc4c3f.
The kill switch response in TaskQueue is fairly loose. Since the response is primarily handled at the job level, any tasks that start after the kill switch is activated will take care of themselves and self terminate Update Nov 13: TaskQueue response to the kill switch was tightened in b302eca. Still, a race condition exists where a job may be started within a narrow window of time after the kill switch has been set, but if that happens the leaked job will still self terminate within the monitor function's polling interval (currently 1 second).

cwl-bot · 2024-02-04T00:49:07Z

This pull request has been mentioned on Common Workflow Language Discourse. There might be relevant details there:

https://cwl.discourse.group/t/how-to-fail-fast-during-parallel-scatter/868/5

cwltool/job.py

codecov · 2024-04-17T08:08:20Z

Codecov Report

Attention: Patch coverage is 67.58621% with 47 lines in your changes missing coverage. Please review.

Project coverage is 80.53%. Comparing base (e4d42b8) to head (e139fae).

Files with missing lines	Patch %	Lines
cwltool/job.py	69.51%	18 Missing and 7 partials ⚠️
cwltool/workflow_job.py	63.63%	4 Missing and 4 partials ⚠️
cwltool/workflow.py	61.53%	5 Missing ⚠️
cwltool/executors.py	63.63%	2 Missing and 2 partials ⚠️
cwltool/task_queue.py	55.55%	3 Missing and 1 partial ⚠️
cwltool/errors.py	83.33%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1974      +/-   ##
==========================================
- Coverage   83.93%   80.53%   -3.41%     
==========================================
  Files          46       46              
  Lines        8312     8396      +84     
  Branches     1959     1973      +14     
==========================================
- Hits         6977     6762     -215     
- Misses        856     1072     +216     
- Partials      479      562      +83

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests
JS Bundle Analysis - Avoid shipping oversized bundles

mr-c

Thank you, again, @AlexTate for your PR!

tests/test_parallel.py::test_on_error_kill is unfortunately failing.

cwltool/argparser.py

mr-c · 2024-10-24T09:59:08Z

Thank you again for this contribution, @AlexTate ! Alas, the test is sometimes hanging in CI: https://github.com/common-workflow-language/cwltool/actions/runs/11496251742/job/31997507449?pr=1974#step:8:1983

mr-c

Almost there!

Check make diff_pydocstyle_report or tox -e py312-pydocstyle

cwltool/task_queue.py

… runtimeContext.on_error = "kill", then the switch is activated. WorkflowKillSwitch is raised so it can be handled at the workflow and executor levels

…ch's status in the monitor function. The monitor function, up to this point, has been for gathering memory usage statistics via a timer thread. A second timer thread now monitors the kill switch.

…revent pending tasks from starting by simply draining the queue. This is a very loose policy, but since kill switch response is handled at the job level, any tasks that start after the kill switch is activated will take care of themselves and self terminate

… an executor. The workflow_eval_lock release had to be moved to the finally block in MultithreadedJobExecutor.run_jobs(). Otherwise, TaskQueue threads running MultithreadedJobExecutor._runner() will never join() because _runner() waits indefinitely for the workflow_eval_lock in its own finally block.

So that the runtime_context object can still be pickled. Other cleanups

…askQueue. This helps to better synchronize the kill switch event and avoid adding/executing tasks after the switch has been set. This approach is tighter than my previous draft, but a race condition still exists where a task might be started after the kill switch has been set and announced. If this happens then the leaked job's monitor function will kill it and the subprocess' lifespan will be a maximum of the monitor's timer interval (currently 1 second). So when this rare event happens, the console output will be potentially confusing since it will show a new job starting after the kill switch has been announced.

… when exiting due to kill switch. Those actions have been placed under a `finally` block so that they are executed by both the "switching" job and the "responding" jobs. However, some of these post actions added a lot of redundant and unhelpful terminal output when handling jobs killed DUE TO the kill switch. The verbose output obscured the error's cause which isn't helpful. Two new process statuses have been added in order to better handle the event: - indeterminant: a default value for processStatus. - killed: the job was killed due to the kill switch being set. This approach also means that partial outputs aren't collected from jobs that have been killed.

1) Once a job has been terminated, all other parallel jobs should also terminate. In this test, the runtime of the workflow indicates whether the kill switch has been handled correctly. If the kill switch is successful then the workflow's runtime should be significantly shorter than sleep_time. 2) Outputs produced by a successful step should still be collected. In this case, the completed step is make_array. To be frank, this test could be simplified by using a ToolTimeLimit requirement rather than process_roulette.cwl

…to this issue. Other changes were offered by the tool, but they are outside the scope of this issue.

…ve MultithreadedJobExecutor ignore allocated resources when deciding whether to run the next parallel job. The steps in this workflow aren't resource intensive, and delaying their execution on this basis will cause the test to fail.

…constraint. The current ResourceRequirement implementation doesn't allow {coresMin: 0}. However, this can still be achieved with a custom RuntimeContext.select_resources()

…matting compliance updates in test_parallel.py

…ble type checking on selectResources() because it's just an implementation detail for the test

…cessful steps when 1) these steps are upstream from a scattered subworkflow, 2) the workflow kill switch is activated by one of the scatter jobs, and 3) on_error==kill

…s() and WorkflowJob.job(). There isn't a need to use getdefault() when querying the value because a default is already set when RuntimeContext is constructed. The checked condition additionally applies to on_error==kill, so the logic can be simplified to on_error!=continue.

…sses. This can be VERY helpful while debugging, particularly when unraveling callback chains.

….receive_output(). Otherwise, MultithreadedJobExecutor.run_jobs() will likely stop iterating over the topmost WorkflowJob.job() before WorkflowJob.do_output_callback() is called to deliver the final workflow outputs.

…ill switch via a ToolTimeLimit requirement. It also uses a much longer timeout which will hopefully be sufficient for the CI server when it is congested.

…Step's repr string, and adding docstrings. Also adding docstring to parallel_steps() because pydocstyle yelled about it. At first it also yelled about object_from_state() and now it doesn't, so... I guess we'll see what the CI run says because I'm not familiar enough with this function to write a docstring for it.

@mr-c

…ent as an argument rather than an entire RuntimeContext, per @mr-c

mr-c reviewed Feb 8, 2024

View reviewed changes

cwltool/job.py Outdated Show resolved Hide resolved

mr-c force-pushed the on-error-abort branch from b92527f to 68c4ebd Compare April 17, 2024 08:01

mr-c force-pushed the on-error-abort branch from 68c4ebd to f249ccb Compare April 20, 2024 18:32

AlexTate marked this pull request as ready for review August 7, 2024 02:20

mr-c requested changes Aug 19, 2024

View reviewed changes

mr-c reviewed Oct 23, 2024

View reviewed changes

cwltool/argparser.py Outdated Show resolved Hide resolved

mr-c requested changes Nov 12, 2024

View reviewed changes

cwltool/task_queue.py Outdated Show resolved Hide resolved

AlexTate and others added 20 commits November 13, 2024 18:02

Adding inputs for new "kill" mode of the on-error parameter

572e865

Adding kill switch trigger. If a job's processStatus != "success" and…

d7b7712

… runtimeContext.on_error = "kill", then the switch is activated. WorkflowKillSwitch is raised so it can be handled at the workflow and executor levels

Actively running jobs respond to the kill switch by checking the swit…

788a89d

…ch's status in the monitor function. The monitor function, up to this point, has been for gathering memory usage statistics via a timer thread. A second timer thread now monitors the kill switch.

Don't create runtime_context.kill_switch by default

0a00582

So that the runtime_context object can still be pickled. Other cleanups

Formatting corrections provided by make cleanup which are relevant …

5924cfe

…to this issue. Other changes were offered by the tool, but they are outside the scope of this issue.

cleanups

1bee55f

DO NOT MERGE, let all tests run

7eb2c0d

Removing irrelevant debug line from test body

03cde29

Using a different approach to remove the "one job per core" resource …

36f12cf

…constraint. The current ResourceRequirement implementation doesn't allow {coresMin: 0}. However, this can still be achieved with a custom RuntimeContext.select_resources()

Adding suggested changes to the --on-error help string per @mr-c. For…

791fe7b

…matting compliance updates in test_parallel.py

Missed running make mypy before pushing last commit. Opting to disa…

93948fb

…ble type checking on selectResources() because it's just an implementation detail for the test

don't cancel jobs; typecheck the test just in case

d938bf8

test_on_error_kill: allow for some longer runtime due to busy systems

5d92d2d

Fixing ReceiveScatterOutput to allow outputs to be collected from suc…

fe53fb8

…cessful steps when 1) these steps are upstream from a scattered subworkflow, 2) the workflow kill switch is activated by one of the scatter jobs, and 3) on_error==kill

AlexTate added 8 commits November 13, 2024 18:06

Adding __repr__ to WorkflowStep, WorkflowJob, and WorkflowJobStep cla…

8e0b6f2

…sses. This can be VERY helpful while debugging, particularly when unraveling callback chains.

Simplifying the approach in on-error_kill.cwl. It now activates the k…

6485cff

…ill switch via a ToolTimeLimit requirement. It also uses a much longer timeout which will hopefully be sufficient for the CI server when it is congested.

Reformatting and type hint updates

beb6af7

Formatting mistake fix

50f2793

Changing TaskQueue constructor to accept the kill_switch threading.Ev…

e8f9284

…ent as an argument rather than an entire RuntimeContext, per @mr-c

mr-c force-pushed the on-error-abort branch from 23fbe19 to e8f9284 Compare November 13, 2024 17:07

mr-c requested a review from tetron November 13, 2024 17:09

mr-c and others added 2 commits November 13, 2024 18:56

codecov: don't fail CI if there is an error

f17f678

Merge branch 'main' into on-error-abort

e139fae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding new choice to --on-error #1974

Adding new choice to --on-error #1974

AlexTate commented Feb 4, 2024 •

edited

Loading

cwl-bot commented Feb 4, 2024

codecov bot commented Apr 17, 2024 •

edited

Loading

mr-c left a comment

mr-c commented Oct 24, 2024

mr-c left a comment

Adding new choice to --on-error #1974

Are you sure you want to change the base?

Adding new choice to --on-error #1974

Conversation

AlexTate commented Feb 4, 2024 • edited Loading

Summary

Motivation

Demonstration of the Issue

Forum Post

Concerns

cwl-bot commented Feb 4, 2024

codecov bot commented Apr 17, 2024 • edited Loading

Codecov Report

mr-c left a comment

Choose a reason for hiding this comment

mr-c commented Oct 24, 2024

mr-c left a comment

Choose a reason for hiding this comment

AlexTate commented Feb 4, 2024 •

edited

Loading

codecov bot commented Apr 17, 2024 •

edited

Loading