Pod groups: e2e tests for diverse pods and preemption #1638

mimowo · 2024-01-24T16:37:36Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

To verify the feature works as expected and prevent regressions.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

In order to verify the tests aren't flaky I let them run for 1h (until timeout) in a loop, and no errors.

Does this PR introduce a user-facing change?

NONE

netlify · 2024-01-24T16:37:42Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`04dd696`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/65b766375cc3250008aefb57

mimowo · 2024-01-25T13:10:54Z

Ok, I see why a re-admitted workload is deleted after completion, I find it inconsistent, because it does not get if finished in the first run. Here is the scenario:

we create the low-priority group
then high priority group, this preempts the low priority group
the low priority group pods are Terminating, because they have finalizers; the pods [p1, p2] are owners of the workload "w"
after readmission, when replacement pods [repl-p1, repl-p2] are created for the workload "w", but they don't become owners of the workload
the workload completes, and Finalize is called. This removes finalizers from all pods. Then, (p1, p2) get deleted, who were the only owners of the workload, so the workload gets deleted by garbage-collector

So, this feels inconsistent, because the workload would stay if succeeded in the first run. It took me a while to understand the interactions, but maybe it is acceptable, because the workload is finished anyway, and workloads aren't really user-facing APIs. WDYT @alculquicondor @tenzen-y ? I guess in the e2e I can just assume it is deleted.

alculquicondor · 2024-01-25T15:41:19Z

As discussed in #1557, this is a bug.

tenzen-y · 2024-01-25T16:05:45Z

Ok, I see why a re-admitted workload is deleted after completion, I find it inconsistent, because it does not get if finished in the first run. Here is the scenario:

we create the low-priority group

then high priority group, this preempts the low priority group

the low priority group pods are Terminating, because they have finalizers; the pods [p1, p2] are owners of the workload "w"

after readmission, when replacement pods [repl-p1, repl-p2] are created for the workload "w", but they don't become owners of the workload

the workload completes, and Finalize is called. This removes finalizers from all pods. Then, (p1, p2) get deleted, who were the only owners of the workload, so the workload gets deleted by garbage-collector

So, this feels inconsistent, because the workload would stay if succeeded in the first run. It took me a while to understand the interactions, but maybe it is acceptable, because the workload is finished anyway, and workloads aren't really user-facing APIs. WDYT @alculquicondor @tenzen-y ? I guess in the e2e I can just assume it is deleted.

Thank you for the clarifications! As Aldo mentioned in #1557. I also think this is a bug.

mimowo · 2024-01-25T16:11:29Z

/retest
unrelated failure before tests started net/http: TLS handshake timeout

mimowo · 2024-01-26T11:01:27Z

/assign @tenzen-y @alculquicondor

alculquicondor

/approve
/hold
for test retries.

test/e2e/singlecluster/pod_test.go

alculquicondor · 2024-01-26T18:54:36Z

test/e2e/singlecluster/pod_test.go

+								// For replacement pods use args that let it complete fast.
+								rep.Name = "replacement-for-" + rep.Name
+								rep.Spec.Containers[0].Args = []string{"1ms"}
+								gomega.Expect(k8sClient.Create(ctx, rep)).To(gomega.Succeed())


I wonder if there is potential for flakiness here, as Kueue might not have observed the Pod as Failed yet.

I'm not sure if events within a kind are ordered. In that case, Kueue might only see the replacement Pod after it has seen the other Pod as failed, in which case there wouldn't be flakiness.

Let's run the tests a few times.

Otherwise, we might have to implement the logic in which, instead of deleting excess Pods, they are left gated until there is space.

I assume there is no issue with flakiness, because I run this in a look locally and for >1h and all attempts have passed. Also, all attempts on GH CI passed (6).

alculquicondor · 2024-01-26T18:56:01Z

/test pull-kueue-test-e2e-main-1-27
/test pull-kueue-test-e2e-main-1-28
/test pull-kueue-test-e2e-main-1-29

mimowo · 2024-01-29T08:48:48Z

/test pull-kueue-test-e2e-main-1-27
/test pull-kueue-test-e2e-main-1-28
/test pull-kueue-test-e2e-main-1-29

mimowo · 2024-01-29T08:52:39Z

/assign @tenzen-y

Thank you for the clarifications! As Aldo mentioned in #1557. I also think this is a bug.

I leave this as a follow up fix with a TODO comment. Once fixed we can just add a new verification step.

tenzen-y

LGTM!

/lgtm
/approve
/hold cancel

k8s-ci-robot · 2024-01-29T09:39:44Z

LGTM label has been added.

Git tree hash: b4f0ccecf7923d07707dca0762e4a55813a2d4f6

k8s-ci-robot · 2024-01-29T09:39:46Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, mimowo, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [alculquicondor,tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tenzen-y · 2024-01-29T09:56:47Z

/retest
due to #1658

…s#1638) * WIP: Add more e2e tests for pod groups * cleanup * review comment

k8s-ci-robot requested review from denkensk and trasc January 24, 2024 16:37

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 24, 2024

mimowo force-pushed the add-e2e-tests-pod-groups branch from 4722065 to a1976ad Compare January 25, 2024 16:05

WIP: Add more e2e tests for pod groups

0899f55

mimowo force-pushed the add-e2e-tests-pod-groups branch from a1976ad to 0899f55 Compare January 25, 2024 16:26

cleanup

cdca39c

mimowo changed the title ~~WIP: Add more e2e tests for pod groups~~ Pod groups: e2e tests for diverse pods and preemption Jan 26, 2024

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 26, 2024

k8s-ci-robot assigned alculquicondor and tenzen-y Jan 26, 2024

alculquicondor reviewed Jan 26, 2024

View reviewed changes

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 26, 2024

review comment

04dd696

tenzen-y reviewed Jan 29, 2024

View reviewed changes

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 29, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 29, 2024

k8s-ci-robot merged commit cb3c2f6 into kubernetes-sigs:main Jan 29, 2024
14 checks passed

k8s-ci-robot added this to the v0.6 milestone Jan 29, 2024

mimowo deleted the add-e2e-tests-pod-groups branch February 10, 2024 11:47

kannon92 pushed a commit to openshift-kannon92/kubernetes-sigs-kueue that referenced this pull request Nov 19, 2024

Pod groups: e2e tests for diverse pods and preemption (kubernetes-sig…

12209b5

…s#1638) * WIP: Add more e2e tests for pod groups * cleanup * review comment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod groups: e2e tests for diverse pods and preemption #1638

Pod groups: e2e tests for diverse pods and preemption #1638

mimowo commented Jan 24, 2024 •

edited

Loading

netlify bot commented Jan 24, 2024 •

edited

Loading

mimowo commented Jan 25, 2024

alculquicondor commented Jan 25, 2024

tenzen-y commented Jan 25, 2024

mimowo commented Jan 25, 2024

mimowo commented Jan 26, 2024

alculquicondor left a comment

alculquicondor Jan 26, 2024

mimowo Jan 29, 2024

alculquicondor commented Jan 26, 2024

mimowo commented Jan 29, 2024

mimowo commented Jan 29, 2024

tenzen-y left a comment

k8s-ci-robot commented Jan 29, 2024

k8s-ci-robot commented Jan 29, 2024

tenzen-y commented Jan 29, 2024

Pod groups: e2e tests for diverse pods and preemption #1638

Pod groups: e2e tests for diverse pods and preemption #1638

Conversation

mimowo commented Jan 24, 2024 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

netlify bot commented Jan 24, 2024 • edited Loading

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

mimowo commented Jan 25, 2024

alculquicondor commented Jan 25, 2024

tenzen-y commented Jan 25, 2024

mimowo commented Jan 25, 2024

mimowo commented Jan 26, 2024

alculquicondor left a comment

Choose a reason for hiding this comment

alculquicondor Jan 26, 2024

Choose a reason for hiding this comment

mimowo Jan 29, 2024

Choose a reason for hiding this comment

alculquicondor commented Jan 26, 2024

mimowo commented Jan 29, 2024

mimowo commented Jan 29, 2024

tenzen-y left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jan 29, 2024

k8s-ci-robot commented Jan 29, 2024

tenzen-y commented Jan 29, 2024

mimowo commented Jan 24, 2024 •

edited

Loading

netlify bot commented Jan 24, 2024 •

edited

Loading