Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RayJob: update Finished() to account for JobDeploymentStatus #3120

Merged
merged 1 commit into from
Sep 24, 2024

Conversation

andrewsykim
Copy link
Member

What type of PR is this?

/kind bug

What this PR does / why we need it:

RayJob status has both a JobStatus field and a JobDeploymentStatus field. The former represents the status of the running Ray job and the latter represents the state of the RayCluster and RayJob. The Finished() implementation in Kueue for RayJob should look at the JobDeployment for "finished" since that is what represents whether resources for the job have been cleaned up. This will also catch some cases where JobStatus will remain "Running" even though JobDeploymentStatus is "Failed". This can specifically happen when activeDeadlineSeconds is exceeded.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

RayJob's implementation of Finished() now inspects at JobDeploymentStatus 

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Sep 23, 2024
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 23, 2024
Copy link

netlify bot commented Sep 23, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 665c112
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/66f1c4105a3e840008796075

@andrewsykim
Copy link
Member Author

@tenzen-y @kevin85421 @astefanutti PTAL

@andrewsykim
Copy link
Member Author

/retest

@astefanutti
Copy link
Member

That makes sense, thanks.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 24, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 1914b02f5adbf515a0575840544daa6bfd49ab6f

Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
LGTM (not tagging yet just to wait for addressing the nits)

Comment on lines +333 to +335
t.Logf("actual success: %v", success)
t.Logf("expected success: %v", testcase.expectedSuccess)
t.Error("unexpected result for 'success'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
t.Logf("actual success: %v", success)
t.Logf("expected success: %v", testcase.expectedSuccess)
t.Error("unexpected result for 'success'")
t.Errorf("unexpected result for 'success'; want=%v, got=%v", testcase.expectedSuccess, success)

nit

Comment on lines +339 to +341
t.Logf("actual finished: %v", finished)
t.Logf("expected finished: %v", testcase.expectedFinished)
t.Error("unexpected result for 'finished'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
t.Logf("actual finished: %v", finished)
t.Logf("expected finished: %v", testcase.expectedFinished)
t.Error("unexpected result for 'finished'")
t.Errorf("unexpected result for 'finished'; want=%v, got=%v", testcase.expectedFinished, finished)

nit

expectedFinished: true,
},
{
name: "jobStatus=Running, jobDeploymentStatus=Failed (when activeDeadlineSeconds is exceeded)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with this workaround in Kueue, but wondering if we should additionally ticket this in Ray (I would expect jobStatus=Failed in this case).

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewsykim, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 24, 2024
@k8s-ci-robot k8s-ci-robot merged commit f275156 into kubernetes-sigs:main Sep 24, 2024
16 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.9 milestone Sep 24, 2024
@mimowo
Copy link
Contributor

mimowo commented Sep 24, 2024

Already merged, that's fine :)

@andrewsykim
Copy link
Member Author

@mimowo is there a v0.8.2 release planned? If so I'd like to consider this for backport

@mimowo
Copy link
Contributor

mimowo commented Sep 24, 2024

We haven't yet planned it, but we estimated 0.9 to be in around 3-4 weeks.
Given that we have an important bug fixed, and the timeline for 0.9 is rather distant I'm in favor of preparing 0.8.2.

/cc @tenzen-y @alculquicondor

/cherry-pick release-0.8

@k8s-infra-cherrypick-robot
Copy link
Contributor

@mimowo: new pull request created: #3128

In response to this:

We haven't yet planned it, but we estimated 0.9 to be in around 3-4 weeks.
Given that we have an important bug fixed, and the timeline for 0.9 is rather distant I'm in favor of preparing 0.8.2.

/cc @tenzen-y @alculquicondor

/cherry-pick release-0.8

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@andrewsykim
Copy link
Member Author

@mimowo if this is the only bug fix then I think it's fine to wait for v0.9.0. I don't consider this a critical bug fix.

However, if there were other fixes that warrant a new patch release, then please include this one. Thank you! :)

@mimowo
Copy link
Contributor

mimowo commented Sep 24, 2024

We also have this one, #3093
and I think we should consider including #3118 but it is not reviewed yet)
When the leader election is enabled, set true in leaderElectionReleaseOnCancel (#3096)

@tenzen-y
Copy link
Member

This will also catch some cases where JobStatus will remain "Running" even though JobDeploymentStatus is "Failed". This can specifically happen when activeDeadlineSeconds is exceeded.

I guess that the regression situation happen only when users use the latest KubeRay version, right?

@andrewsykim
Copy link
Member Author

I guess that the regression situation happen only when users use the latest KubeRay version, right?

No it's specifically if you set activeDeadlineSeconds and a RayJob expires. In this case JobStatus=Running and JobDeploymentStatus=Failed and the workload is never removed from ClusterQueue

@tenzen-y
Copy link
Member

activeDeadlineSeconds

Oh, I was confused with the backoffLimit field. You're right.

@tenzen-y
Copy link
Member

We haven't yet planned it, but we estimated 0.9 to be in around 3-4 weeks.
Given that we have an important bug fixed, and the timeline for 0.9 is rather distant I'm in favor of preparing 0.8.2.

SGTM, ideally, we want to cut a patch release every month and want to cut a minor release every 4 months.

kannon92 pushed a commit to openshift-kannon92/kubernetes-sigs-kueue that referenced this pull request Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants