-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Jobs] Add option to specify max_restarts_on_errors
#4169
Merged
Merged
Changes from 33 commits
Commits
Show all changes
35 commits
Select commit
Hold shift + click to select a range
8eba87b
Add option to specify `max_retry_on_failure`
Michaelvll 7294204
fix recover counts
Michaelvll 3ab9619
fix log streaming
Michaelvll 7145842
fix docs
Michaelvll 8bfd59a
fix
Michaelvll e459271
fix
Michaelvll de78310
fix
Michaelvll 23345c0
fix
Michaelvll 3709cd6
fix default value
Michaelvll 92e7c35
Fix spinner
Michaelvll 935491e
Add unit test for default strategy
Michaelvll 90f95b1
fix test
Michaelvll ceff8cd
format
Michaelvll a20fa5c
Update docs/source/examples/managed-jobs.rst
Michaelvll b5b35f4
rename to restarts
Michaelvll 149c9fd
Merge branch 'jobs-max-retry-on-failure' of github.com:skypilot-org/s…
Michaelvll 1947605
Merge branch 'master' of github.com:skypilot-org/skypilot into jobs-m…
Michaelvll 7cf2b17
Update docs/source/examples/managed-jobs.rst
Michaelvll 44882fe
update docs
Michaelvll 599a838
Merge branch 'master' of github.com:skypilot-org/skypilot into jobs-m…
Michaelvll a7d266b
warning instead of error out
Michaelvll 087414b
Update docs/source/examples/managed-jobs.rst
Michaelvll 3ffadb1
rename
Michaelvll da26fc1
add comment
Michaelvll bea7fe0
Merge branch 'jobs-max-retry-on-failure' of github.com:skypilot-org/s…
Michaelvll 987df3d
fix
Michaelvll c3e88a2
rename
Michaelvll 92acfe3
Update sky/execution.py
Michaelvll 71e9518
Update sky/execution.py
Michaelvll acd96ab
address comments
Michaelvll 02b0b19
Merge branch 'jobs-max-retry-on-failure' of github.com:skypilot-org/s…
Michaelvll 86d0d64
format
Michaelvll 0573c33
commit changes for docs
Michaelvll df62ee4
Format
Michaelvll d79cd2f
Merge branch 'master' of github.com:skypilot-org/skypilot into jobs-m…
Michaelvll File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -70,7 +70,7 @@ | |
# state, after the job finished. This is a safeguard to avoid the case where | ||
# the managed job status fails to be updated and keep the `sky jobs logs` | ||
# blocking for a long time. | ||
_FINAL_JOB_STATUS_WAIT_TIMEOUT_SECONDS = 20 | ||
_FINAL_JOB_STATUS_WAIT_TIMEOUT_SECONDS = 25 | ||
|
||
|
||
class UserSignal(enum.Enum): | ||
|
@@ -392,8 +392,12 @@ def stream_logs_by_id(job_id: int, follow: bool = True) -> str: | |
f'INFO: Log for the current task ({task_id}) ' | ||
'is finished. Waiting for the next task\'s log ' | ||
'to be started.') | ||
status_display.update('Waiting for the next task: ' | ||
f'{task_id + 1}.') | ||
# Add a newline to avoid the status display below | ||
# removing the last line of the task output. | ||
print() | ||
romilbhardwaj marked this conversation as resolved.
Show resolved
Hide resolved
|
||
status_display.update( | ||
ux_utils.spinner_message( | ||
f'Waiting for the next task: {task_id + 1}')) | ||
status_display.start() | ||
original_task_id = task_id | ||
while True: | ||
|
@@ -405,7 +409,27 @@ def stream_logs_by_id(job_id: int, follow: bool = True) -> str: | |
time.sleep(JOB_STATUS_CHECK_GAP_SECONDS) | ||
continue | ||
else: | ||
break | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems the retry logic is added in the wrong branch. It's currently in the |
||
task_specs = managed_job_state.get_task_specs( | ||
job_id, task_id) | ||
if task_specs.get('max_restarts_on_errors', 0) == 0: | ||
# We don't need to wait for the managed job status | ||
# update, as the job is guaranteed to be in terminal | ||
# state afterwards. | ||
break | ||
print() | ||
status_display.update( | ||
ux_utils.spinner_message( | ||
'Waiting for next restart for the failed task')) | ||
status_display.start() | ||
while True: | ||
_, managed_job_status = ( | ||
managed_job_state.get_latest_task_id_status( | ||
job_id)) | ||
if (managed_job_status != | ||
managed_job_state.ManagedJobStatus.RUNNING): | ||
break | ||
time.sleep(JOB_STATUS_CHECK_GAP_SECONDS) | ||
continue | ||
# The job can be cancelled by the user or the controller (when | ||
# the cluster is partially preempted). | ||
logger.debug( | ||
|
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of our users mentioned backoff between restarts - any thoughts on adding it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have backoff between launches if the resources are not available across all regions/clouds. I feel adding additional backoff between job restarts is not that clean.