Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod groups: e2e tests for diverse pods and preemption #1638

Merged
merged 3 commits into from
Jan 29, 2024

Conversation

mimowo
Copy link
Contributor

@mimowo mimowo commented Jan 24, 2024

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

To verify the feature works as expected and prevent regressions.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

In order to verify the tests aren't flaky I let them run for 1h (until timeout) in a loop, and no errors.

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 24, 2024
Copy link

netlify bot commented Jan 24, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 04dd696
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/65b766375cc3250008aefb57

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 24, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Jan 25, 2024

Ok, I see why a re-admitted workload is deleted after completion, I find it inconsistent, because it does not get if finished in the first run. Here is the scenario:

  1. we create the low-priority group
  2. then high priority group, this preempts the low priority group
  3. the low priority group pods are Terminating, because they have finalizers; the pods [p1, p2] are owners of the workload "w"
  4. after readmission, when replacement pods [repl-p1, repl-p2] are created for the workload "w", but they don't become owners of the workload
  5. the workload completes, and Finalize is called. This removes finalizers from all pods. Then, (p1, p2) get deleted, who were the only owners of the workload, so the workload gets deleted by garbage-collector

So, this feels inconsistent, because the workload would stay if succeeded in the first run. It took me a while to understand the interactions, but maybe it is acceptable, because the workload is finished anyway, and workloads aren't really user-facing APIs. WDYT @alculquicondor @tenzen-y ? I guess in the e2e I can just assume it is deleted.

@alculquicondor
Copy link
Contributor

As discussed in #1557, this is a bug.

@mimowo mimowo force-pushed the add-e2e-tests-pod-groups branch from 4722065 to a1976ad Compare January 25, 2024 16:05
@tenzen-y
Copy link
Member

Ok, I see why a re-admitted workload is deleted after completion, I find it inconsistent, because it does not get if finished in the first run. Here is the scenario:

  1. we create the low-priority group
  2. then high priority group, this preempts the low priority group
  3. the low priority group pods are Terminating, because they have finalizers; the pods [p1, p2] are owners of the workload "w"
  4. after readmission, when replacement pods [repl-p1, repl-p2] are created for the workload "w", but they don't become owners of the workload
  5. the workload completes, and Finalize is called. This removes finalizers from all pods. Then, (p1, p2) get deleted, who were the only owners of the workload, so the workload gets deleted by garbage-collector

So, this feels inconsistent, because the workload would stay if succeeded in the first run. It took me a while to understand the interactions, but maybe it is acceptable, because the workload is finished anyway, and workloads aren't really user-facing APIs. WDYT @alculquicondor @tenzen-y ? I guess in the e2e I can just assume it is deleted.

Thank you for the clarifications! As Aldo mentioned in #1557. I also think this is a bug.

@mimowo
Copy link
Contributor Author

mimowo commented Jan 25, 2024

/retest
unrelated failure before tests started net/http: TLS handshake timeout

@mimowo mimowo force-pushed the add-e2e-tests-pod-groups branch from a1976ad to 0899f55 Compare January 25, 2024 16:26
@mimowo mimowo changed the title WIP: Add more e2e tests for pod groups Pod groups: e2e tests for diverse pods and preemption Jan 26, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 26, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Jan 26, 2024

/assign @tenzen-y @alculquicondor

Copy link
Contributor

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/hold
for test retries.

test/e2e/singlecluster/pod_test.go Outdated Show resolved Hide resolved
// For replacement pods use args that let it complete fast.
rep.Name = "replacement-for-" + rep.Name
rep.Spec.Containers[0].Args = []string{"1ms"}
gomega.Expect(k8sClient.Create(ctx, rep)).To(gomega.Succeed())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is potential for flakiness here, as Kueue might not have observed the Pod as Failed yet.

I'm not sure if events within a kind are ordered. In that case, Kueue might only see the replacement Pod after it has seen the other Pod as failed, in which case there wouldn't be flakiness.

Let's run the tests a few times.

Otherwise, we might have to implement the logic in which, instead of deleting excess Pods, they are left gated until there is space.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume there is no issue with flakiness, because I run this in a look locally and for >1h and all attempts have passed. Also, all attempts on GH CI passed (6).

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 26, 2024
@alculquicondor
Copy link
Contributor

/test pull-kueue-test-e2e-main-1-27
/test pull-kueue-test-e2e-main-1-28
/test pull-kueue-test-e2e-main-1-29

@mimowo
Copy link
Contributor Author

mimowo commented Jan 29, 2024

/test pull-kueue-test-e2e-main-1-27
/test pull-kueue-test-e2e-main-1-28
/test pull-kueue-test-e2e-main-1-29

@mimowo
Copy link
Contributor Author

mimowo commented Jan 29, 2024

/assign @tenzen-y

Thank you for the clarifications! As Aldo mentioned in #1557. I also think this is a bug.

I leave this as a follow up fix with a TODO comment. Once fixed we can just add a new verification step.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

/lgtm
/approve
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 29, 2024
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 29, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: b4f0ccecf7923d07707dca0762e4a55813a2d4f6

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, mimowo, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [alculquicondor,tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tenzen-y
Copy link
Member

/retest
due to #1658

@k8s-ci-robot k8s-ci-robot merged commit cb3c2f6 into kubernetes-sigs:main Jan 29, 2024
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.6 milestone Jan 29, 2024
@mimowo mimowo deleted the add-e2e-tests-pod-groups branch February 10, 2024 11:47
kannon92 pushed a commit to openshift-kannon92/kubernetes-sigs-kueue that referenced this pull request Nov 19, 2024
…s#1638)

* WIP: Add more e2e tests for pod groups

* cleanup

* review comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants