-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revisit optimizing the integration test runtime #2993
Comments
FYI the previous effort (for x-ref): #1737 |
/assign |
Actually, I proposed refactoring the MultiKueue Kubeflow E2E and integration testing. |
@tenzen-y what is the aim of the refactoring, can you point to the discussion? How will this impact the integration test performance? Also, before we start optimizing the status quo, I would really appreciate investigation why the time increased from 9min to 16min since #1737 (comment), which was just 6 months ago. It could be just due to new tests, but maybe there is something else responsible (like slower machines or performance regression). EDIT: what I mean by that is we can check first what the new integration tests added since then, and if they can really account for the additional 7min - I'm surprised by the increase just within 6 months. Maybe some of the new tests are not optimized. |
For starters I propose #3035 which adds a way to report the time taken by individual tests. |
When it comes to overall time consumption the trend was fairly study, the biggest bump (of around 2min) I see is around 05 Jul when race detection was added. $ git log 35586d7539bff45e071d39f7e85ebc87e4245c97..cd89852f2c4d921e2ec51917152f8fdea80eb87d
commit cd89852f2c4d921e2ec51917152f8fdea80eb87d
Author: Irving Mondragón <[email protected]>
Date: Thu Jul 4 22:51:06 2024 +0200
Remove deprecated Hugo template (#2506)
commit 6c619c6becea43415bb189067c5e94e8dcda355f
Author: Mykhailo Bobrovskyi <[email protected]>
Date: Thu Jul 4 20:48:06 2024 +0300
Runs the race detector on integration tests. (#2468)
|
I think the race detection proved to be useful already, so the trade off is not clear to me. I would keep it for now, but just keep in mind that we could gain 2min on changing it. |
@mimowo Actually, I meant the following discussion: |
The decision taken in this discussion sgtm, but I'm not seeing how it is related to this issue, which is focused on integration tests rather than e2e. |
Oh, I was supposed to mention integration testing as well. |
I see, so your suggestion is to only keep integration tests for |
That can be easily done but I don't expect a huge gain out of that , each of the tests are taking under 3 sec to execute. With #3039 I tried to parallelize the cluster creation for multikueue in theory that can reduce the tome with around 40s, but making the setup thread safe is more challenging then expected. Another thing we can try is to reuse the envtest clusters and setup-manager / stop-manager operations, but this may not make too much difference with parallel running and surface new issues due to incomplete cleanups. |
Adds the ability to reuse the envtest instance, and replace the manager. One followup for this can be lazy star the envtest instances, this will probably have a bigger code impact and it's benefits will be visible only in suites that have a lower number of top level specs then the parallelism. |
That may be a little bit different. I'm wondering if we can add all Kubeflow MultiKueue cases only in PyTorchJob and add only basic creation cases in all other Kubeflow Jobs. That is similar to Kubeflow Jobs integration testings (not MultiKueu). |
It is done in #3085 |
With #3085 merged
With this the testing time will get from 18min to around 12min for PRs, and around 14-45min for periodic builds. |
Wonderful, thank you! |
Great to see! Thanks! |
Ideally it would be nice to be around 10min, but maybe keeping the goal is not feasible as the project grows, so I think we can close. Unless @trasc you have some more ideas you want to follow up with? |
The recent #3176 inspires me to ask if we could re-visit reducing the podsReadyTimeout used in a couple of tests, like here, or here to use the TinyTimeout rather than ShortTimeout, can you check that yet @mbobrovskyi or @trasc ? |
Unfortunately, no. I already optimized this in #2329, and this is the minimum we can set. |
Ack |
In some suites we don't need to restart the manager with a different configuration. |
So, IIUC for these suites does not matter from performance PoV, because the manager is started only once anyway. Still, it might be worth following up to make it consistent, as people would often copy-paste the existing tests, and if they copy the ones using RunManager we may not have optimal performance in the future. |
|
One thing I wanted to check was the impact of parallelism and
So besides maybe dropping the I guess we can close this issue. |
Thanks for the summary. Regarding NProcs 2 vs. 4, the performance differences aren't big, and are not very consistent. However, I'd prefer to keep it as 4 since higher parallelism might be helpful when we have more tests in the future, and may help us to expose flakes (similar effect as --race). Regarding the I think what could make sense is to drop this flag for presubmits, and only use it for periodic tests. Ideally, local runs also enable the flag by default. WDYT @tenzen-y @alculquicondor @trasc ? |
I also think it would make sense to have a generic env. var like PRESUBMIT, and based on it control the optimizations like INTEGRATION_RACE=false, or INEGRATION_RUN_ALL=false (or directly set the INTEGRATION_TEST_FILTERS). Then, if we change our decisions about presubmit config we don't need to update two places. WDYT? |
Even the race detection might be sometimes flaky for some parts of the code, maybe due to some unfortunate test ordering, I still think it's better to continue doing the check in presubmit. |
Before we enable the race detection, there are many race issues in our integration testing: #2468 So, I'm suspecting that the periodic bot often fail by race issue once we diable it in the presubmit. |
I would not do that. When a race makes it to the main branch, it becomes our problem to solve. The contributor that introduced the change, unfortunately, might not be available. |
Ok, thank you for the input, we have a consensus here. /close |
@mimowo: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
We have had an effort to improve the integration test performance in the past, and we've taken down the time below 10min.
However, recently the integration tests suite takes over 16min based on https://testgrid.k8s.io/sig-scheduling#periodic-kueue-test-integration-main. Specific example: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-kueue-test-integration-main/1831572082656284672
Part of the effort would be to figure out if the slow down can be attributed to more tests, or there is another reason.
The build time is important particularly during release process which takes a couple of builds.
The text was updated successfully, but these errors were encountered: