Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PodsReady condition if WaitForPodsReady is enabled #460

Merged
merged 5 commits into from
Dec 13, 2022

Conversation

mimowo
Copy link
Contributor

@mimowo mimowo commented Dec 5, 2022

What type of PR is this?

/kind feature

What this PR does / why we need it:

This is the first part of implementing the solution designed in: #433.
Admission of queued workloads will be blocked until the currently starting workload has the PodsReady condition (when WaitForPodsReady is configured at the Kueue level).

Which issue(s) this PR fixes:

Part of: #349

Special notes for your reviewer:

  • the current description of the WaitForPodsReady field is not fully implemented - there is going to be a follow up
    PR to implement blocking of admission

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. labels Dec 5, 2022
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 5, 2022
@mimowo mimowo changed the title Add PodsReady condition if WaitForPodsReady is enabled WIP: Add PodsReady condition if WaitForPodsReady is enabled Dec 5, 2022
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Dec 5, 2022
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 5, 2022
@mimowo mimowo force-pushed the pods-ready-condition branch 6 times, most recently from 2f78a66 to 195d055 Compare December 6, 2022 12:23
@mimowo mimowo changed the title WIP: Add PodsReady condition if WaitForPodsReady is enabled Add PodsReady condition if WaitForPodsReady is enabled Dec 6, 2022
@mimowo mimowo marked this pull request as ready for review December 6, 2022 12:40
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 6, 2022
@mimowo mimowo force-pushed the pods-ready-condition branch 5 times, most recently from f2f11c4 to d09fce4 Compare December 6, 2022 13:14
Makefile Outdated Show resolved Hide resolved
apis/kueue/v1alpha2/workload_types.go Outdated Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
apis/config/v1alpha2/configuration_types.go Outdated Show resolved Hide resolved
pkg/controller/workload/job/job_controller.go Outdated Show resolved Hide resolved

for name, tc := range testcases {
t.Run(name, func(t *testing.T) {
got := jobPodsReady(tc.job)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have all the coverage in integration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not everything, for example in integration tests we only test parallelism=completions. Further, the integration tests are much slower, so I would rather decrease the number of scenarios in integration tests than deleting unit tests.

test/integration/controller/job/job_controller_test.go Outdated Show resolved Hide resolved
test/integration/controller/job/job_controller_test.go Outdated Show resolved Hide resolved
pkg/controller/workload/job/job_controller.go Outdated Show resolved Hide resolved
@mimowo mimowo force-pushed the pods-ready-condition branch 3 times, most recently from 1854f1d to c18da68 Compare December 8, 2022 14:29
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 12, 2022
@alculquicondor
Copy link
Contributor

/lgtm
/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Dec 12, 2022
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 12, 2022
@ahg-g
Copy link
Contributor

ahg-g commented Dec 12, 2022

/hold

I would like to take a look

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 12, 2022
@@ -45,6 +45,18 @@ type Configuration struct {

// InternalCertManagement is configuration for internalCertManagement
InternalCertManagement *InternalCertManagement `json:"internalCertManagement,omitempty"`

// WaitForPodsReady is configuration for waitForPodsReady
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not useful docs :)

Copy link
Contributor Author

@mimowo mimowo Dec 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extended

apis/config/v1alpha2/configuration_types.go Outdated Show resolved Hide resolved
apis/config/v1alpha2/configuration_types.go Show resolved Hide resolved
// handle a job when waitForPodsReady is enabled
if r.waitForPodsReady {
log.V(5).Info("Handling a job when waitForPodsReady is enabled")
condition := generatePodsReadyCondition(&job)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to update the condition when workload.admitted is set to nil? Or we want to wait until the job is actually suspended and we get the job update event?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a practical scenario when it matters. Maybe if the request to set admission=nil, but the request to set the suspend=true fails (and is not executed as the job gets quickly re-admitted), but then it seems to make sense to react to the change of the suspend field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only scenario where it could matter is when the workload is immediately admitted again and we didn't have time to update the PodsReady condition to false.

But that could still happen in any case, because we can't update the status and spec at the same time. Unless we can in a mutating webhook?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't reset the condition in a mutating webhook either. To be on the safer side, maybe we can set the condition to false as soon as we see admission=nil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the code now to check for workload admission to be on the safer side.

@@ -265,6 +268,217 @@ var _ = ginkgo.Describe("Job controller for workloads with no queue set", func()
})
})

var _ = ginkgo.Describe("Job controller when waitForPodsReady enabled", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test to verify that the condition gets updated for a workload that initially has spec.Admission set and then unset (to simulate preemption).

Copy link
Contributor Author

@mimowo mimowo Dec 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified the tests, but I reuse the previously added test cases (with suspended=true), but in these cases the job was never unsuspended. Now I unconditionally admit the workload at the beginning, but if a test specifies "suspended=true" then Admission is unset, the job is suspended, and then we verify the condition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg, we can have a test in the scheduler once preemption is implemented.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 13, 2022
@mimowo mimowo force-pushed the pods-ready-condition branch 2 times, most recently from bd4a29a to 2b19996 Compare December 13, 2022 08:17
@mimowo mimowo force-pushed the pods-ready-condition branch 2 times, most recently from 543d096 to 843b00d Compare December 13, 2022 09:57
return InConditionWithStatus(w, condition, metav1.ConditionTrue)
}

func InConditionWithStatus(w *kueue.Workload, condition string, status metav1.ConditionStatus) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't add this function.

As a follow up: remove InCondition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned up unnecessary changes, will create the follow up PR to remove the function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alculquicondor
Copy link
Contributor

/lgtm

Leaving the hold to @ahg-g

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 13, 2022
@ahg-g
Copy link
Contributor

ahg-g commented Dec 13, 2022

/hold cancel
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 13, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, alculquicondor, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [ahg-g,alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 0c18192 into kubernetes-sigs:main Dec 13, 2022
@mimowo mimowo deleted the pods-ready-condition branch March 18, 2023 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants