Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

The status of the AWS batch job should become failed once the retry limit exceeded #291

Merged
merged 15 commits into from
Dec 1, 2022

Conversation

pingsutw
Copy link
Member

@pingsutw pingsutw commented Oct 17, 2022

Signed-off-by: Kevin Su [email protected]

TL;DR

The task status is always running if the task failure is retryable. In other words, the propeller will keep waiting no matter how often the task is re-run. Take a look at these code. If all the sub-task failed and all the failure is retryable, the state of the task will never become PhaseWriteToDiscoveryThenFail. (totalRetryableFailures is equal to minSuccesses here)

In this pr, we kept tracking how often the task has retried and counted the total number of sub-tasks that exceeded the retry limit. Stop the task when totalRunning+totalSuccesses+totalWaitingForResources+totalRetryableFailures-totalRetryLimitExceeded < minSuccesses is true

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

image

Tracking Issue

flyteorg/flyte#2979

Follow-up issue

NA

@codecov
Copy link

codecov bot commented Oct 17, 2022

Codecov Report

Merging #291 (6809ea4) into master (eed7326) will decrease coverage by 1.04%.
The diff coverage is 70.00%.

@@            Coverage Diff             @@
##           master     #291      +/-   ##
==========================================
- Coverage   63.32%   62.27%   -1.05%     
==========================================
  Files         145      145              
  Lines        9311    11511    +2200     
==========================================
+ Hits         5896     7169    +1273     
- Misses       2872     3797     +925     
- Partials      543      545       +2     
Flag Coverage Δ
unittests 62.27% <70.00%> (-0.47%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
go/tasks/plugins/array/awsbatch/executor.go 36.95% <0.00%> (-1.10%) ⬇️
go/tasks/plugins/array/awsbatch/monitor.go 67.42% <73.68%> (+5.62%) ⬆️
go/tasks/plugins/k8s/pod/container.go 57.14% <0.00%> (-9.53%) ⬇️
.../plugins/k8s/kfoperators/common/common_operator.go 71.30% <0.00%> (-8.08%) ⬇️
...o/tasks/pluginmachinery/utils/secrets/marshaler.go 64.28% <0.00%> (-7.94%) ⬇️
go/tasks/plugins/webapi/athena/utils.go 67.92% <0.00%> (-7.35%) ⬇️
...o/tasks/pluginmachinery/ioutils/raw_output_path.go 68.42% <0.00%> (-6.58%) ⬇️
go/tasks/plugins/hive/client/qubole_client.go 59.44% <0.00%> (-6.38%) ⬇️
...sks/plugins/k8s/sagemaker/hyperparameter_tuning.go 55.20% <0.00%> (-5.25%) ⬇️
go/tasks/pluginmachinery/core/plugin.go 70.00% <0.00%> (-5.00%) ⬇️
... and 122 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
@pingsutw pingsutw changed the title Turn PhaseRetryableFailure into PhaseRetryLimitExceededFailure once the retry limit exceeded The status of the AWS batch job should become failed once the retry limit exceeded Oct 17, 2022
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
@pingsutw pingsutw marked this pull request as draft October 18, 2022 18:31
Signed-off-by: Kevin Su <[email protected]>
@pingsutw pingsutw marked this pull request as ready for review October 21, 2022 20:37
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
@hamersaw hamersaw merged commit d5295d2 into master Dec 1, 2022
@hamersaw hamersaw deleted the batch-bug-1 branch December 1, 2022 09:34
eapolinario pushed a commit that referenced this pull request Sep 6, 2023
…imit exceeded (#291)

* Turn PhaseRetryableFailure into PhaseRetryLimitExceededFailure

Signed-off-by: Kevin Su <[email protected]>

* nit

Signed-off-by: Kevin Su <[email protected]>

* update

Signed-off-by: Kevin Su <[email protected]>

* update test

Signed-off-by: Kevin Su <[email protected]>

* lint

Signed-off-by: Kevin Su <[email protected]>

* update

Signed-off-by: Kevin Su <[email protected]>

* update tests

Signed-off-by: Kevin Su <[email protected]>

* lint

Signed-off-by: Kevin Su <[email protected]>

* wip

Signed-off-by: Kevin Su <[email protected]>

* udpate

Signed-off-by: Kevin Su <[email protected]>

* address comment

Signed-off-by: Kevin Su <[email protected]>

* nit

Signed-off-by: Kevin Su <[email protected]>

* fix tests

Signed-off-by: Kevin Su <[email protected]>

* nit

Signed-off-by: Kevin Su <[email protected]>

Signed-off-by: Kevin Su <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants