-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RayJob: update Finished() to account for JobDeploymentStatus #3120
RayJob: update Finished() to account for JobDeploymentStatus #3120
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
4eafb18
to
180045e
Compare
Signed-off-by: Andrew Sy Kim <[email protected]>
180045e
to
665c112
Compare
/retest |
That makes sense, thanks. /lgtm |
LGTM label has been added. Git tree hash: 1914b02f5adbf515a0575840544daa6bfd49ab6f
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
LGTM (not tagging yet just to wait for addressing the nits)
t.Logf("actual success: %v", success) | ||
t.Logf("expected success: %v", testcase.expectedSuccess) | ||
t.Error("unexpected result for 'success'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t.Logf("actual success: %v", success) | |
t.Logf("expected success: %v", testcase.expectedSuccess) | |
t.Error("unexpected result for 'success'") | |
t.Errorf("unexpected result for 'success'; want=%v, got=%v", testcase.expectedSuccess, success) |
nit
t.Logf("actual finished: %v", finished) | ||
t.Logf("expected finished: %v", testcase.expectedFinished) | ||
t.Error("unexpected result for 'finished'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t.Logf("actual finished: %v", finished) | |
t.Logf("expected finished: %v", testcase.expectedFinished) | |
t.Error("unexpected result for 'finished'") | |
t.Errorf("unexpected result for 'finished'; want=%v, got=%v", testcase.expectedFinished, finished) |
nit
expectedFinished: true, | ||
}, | ||
{ | ||
name: "jobStatus=Running, jobDeploymentStatus=Failed (when activeDeadlineSeconds is exceeded)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good with this workaround in Kueue, but wondering if we should additionally ticket this in Ray (I would expect jobStatus=Failed in this case).
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andrewsykim, mimowo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Already merged, that's fine :) |
@mimowo is there a v0.8.2 release planned? If so I'd like to consider this for backport |
We haven't yet planned it, but we estimated 0.9 to be in around 3-4 weeks. /cherry-pick release-0.8 |
@mimowo: new pull request created: #3128 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@mimowo if this is the only bug fix then I think it's fine to wait for v0.9.0. I don't consider this a critical bug fix. However, if there were other fixes that warrant a new patch release, then please include this one. Thank you! :) |
I guess that the regression situation happen only when users use the latest KubeRay version, right? |
No it's specifically if you set |
Oh, I was confused with the backoffLimit field. You're right. |
SGTM, ideally, we want to cut a patch release every month and want to cut a minor release every 4 months. |
…tes-sigs#3120) Signed-off-by: Andrew Sy Kim <[email protected]>
What type of PR is this?
/kind bug
What this PR does / why we need it:
RayJob status has both a
JobStatus
field and aJobDeploymentStatus
field. The former represents the status of the running Ray job and the latter represents the state of the RayCluster and RayJob. TheFinished()
implementation in Kueue for RayJob should look at the JobDeployment for "finished" since that is what represents whether resources for the job have been cleaned up. This will also catch some cases where JobStatus will remain "Running" even though JobDeploymentStatus is "Failed". This can specifically happen whenactiveDeadlineSeconds
is exceeded.Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?