-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky Multikueue E2E tests #1658
Comments
I think this is duplicated with #1649. |
/kind flaky |
@tenzen-y: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/kind flake |
For this behavior, I wold expect the kueue controller manager to be crushed, since before starting the suite it is checked that all the clusters are able to create a resource flavor. So , very likely this is happening due to the heavy load in multikueue case. |
As discussed under #1659 I run experiments looping the e2e multikueue tests. First, I was able to repro the issue locally with a failure rate around 1 out of 4, which is close to the one on GH CI. Second, when running on existing clusters I don't get any failures (37 passes in a row, interrupted manually), which suggests the issue is only during startup. Third, with the following code change I eliminated the failures locally (30 passes in a row, still running): func KueueReadyForTesting(ctx context.Context, client client.Client) {
resourceKueue := utiltesting.MakeResourceFlavor("default").Obj()
gomega.Eventually(func() error {
return client.Create(ctx, resourceKueue)
}, StartUpTimeout, Interval).Should(gomega.Succeed())
cqKueueTest := utiltesting.MakeClusterQueue("q1").
ResourceGroup(
*utiltesting.MakeFlavorQuotas("default").
Resource(corev1.ResourceCPU, "1").
Obj(),
).
Obj()
gomega.Eventually(func() error {
return client.Create(ctx, cqKueueTest)
}, StartUpTimeout, Interval).Should(gomega.Succeed())
ExpectClusterQueueToBeDeleted(ctx, client, cqKueueTest, true)
ExpectResourceFlavorToBeDeleted(ctx, client, resourceKueue, true)
} This suggests also the issue is only on startup. Further, it suggests that for multikueue, where the system is loaded there might be a signifficant difference when the ResourceFlavor webhooks and the ClusterQueue webhooks are functional. This also appears to explain why the PR #1659 is stable. IIUC there is another ongoing effort by @trasc to see if we can have a more generic solution: #1674. |
Ok, I got failure on 32st loop, but on creating LocalQueue, because the localqueue_webhook wasn't ready. This reinforces the statement that the webhooks become ready at different points in time. However, this also means that we would need to add creating LocalQueues to |
What was the error? |
|
I expect |
Not sure I understand to be able to follow up. Do you suggest that this is not caused by webhooks, or that there is a bug in API server that 500 is returned in this case? |
I wonder if the That might be consistent with @mimowo's observations. However, when we added the MK tests, we didn't increase resource requests in the E2E jobs, did we? Perhaps we can start there? |
I think we should do both #1674 and increasing the requests. |
I think I understand now what was happening. Described here: #1659 (comment). Essentially, with 2 replicas running the registered webhooks are distributed randomly between the two replicas. With the |
I have also opened an alternative proposal using probes to wait for the ready replicaes: #1676. Seems to pass consitently, but going to yet test more. |
What happened:
It looks like the Kueue managers start properly, but somehow they crash later.
As a result, we observe:
In
End To End MultiKueue Suite: kindest/node:v1.28.0: [It] MultiKueue when Creating a multikueue admission check Should run a job on worker if admitted
What you expected to happen:
Kueue managers to continue to run properly
How to reproduce it (as minimally and precisely as possible):
https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kueue/1607/pull-kueue-test-e2e-main-1-28/1750942764473782272
Note that this PR only changes documentation, so the flakiness is definitely on the multikueue code.
Anything else we need to know?:
Environment:
kubectl version
):git describe --tags --dirty --always
):cat /etc/os-release
):uname -a
):The text was updated successfully, but these errors were encountered: