Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Jobs] Add option to specify
max_restarts_on_errors
#4169[Jobs] Add option to specify
max_restarts_on_errors
#4169Changes from 17 commits
8eba87b
7294204
3ab9619
7145842
8bfd59a
e459271
de78310
23345c0
3709cd6
92e7c35
935491e
90f95b1
ceff8cd
a20fa5c
b5b35f4
149c9fd
1947605
7cf2b17
44882fe
599a838
a7d266b
087414b
3ffadb1
da26fc1
bea7fe0
987df3d
c3e88a2
92acfe3
71e9518
acd96ab
02b0b19
86d0d64
0573c33
df62ee4
d79cd2f
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of our users mentioned backoff between restarts - any thoughts on adding it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have backoff between launches if the resources are not available across all regions/clouds. I feel adding additional backoff between job restarts is not that clean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of this method is a bit misleading since this does not "trigger" a retry, rather just records it. Maybe rename to
log_retry_on_failure
orrecord_retry_on_failure
?Docstr could also be updated to something like:
"""Records a retry event after a job failure and returns if more retries should be attempted."""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I renamed it to
should_restart_on_failure
with the docstr updated. Wdyt?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems the retry logic is added in the wrong branch. It's currently in the
else
branch ofif task_id < num_tasks - 1 and follow
, which means it only triggers when we want to terminate. The retry check should be in the outerelse
branch where we handle cluster failures.