-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate prowjobs to new infrastructure #8689
Comments
/triage accepted Definitely think we should see how this works. We currently have almost duplicate runs of e2e-full - one with IPv6 enabled and one without - introduced here: kubernetes/test-infra#29519 The IPv6 e2e seems to perform about as well as the normal e2e run with a couple of additional flakes from the dualstack tests (which only exist in the IPv6 variant). WDYT about moving one of those to the AWS cluster to see how it works? This would give us good coverage across a lot of tests on AWS, but wouldn't reduce any of our coverage from what we had ~last week. |
WDYT about moving all 1.2 tests over for now? We can test if everything works and dont introduce a potential additional factor in the tests we care about If that looks good we can continue to move more |
That makes sense +1 to moving over all the 1.2 tests. |
+1 for stating from 1.2, this gives us time to experimenting without affecting release calendar |
Looks like its not just about moving the jobs. There are also additional properties getting enforced around resources (CPU limits and memory limits + requests: e.g.:
|
Hm for CPU we can use requests=limits. This should solve a few cases, maybe all (+/- looking at similar jobs). I have no idea how much memory we need. Maybe we should ask upstream how we can find out the usage of our current jobs, or with which values we should start. Too low memory values are definitely not fun as we have to deal with random errors because of OOM kills :/ |
AFAIK our current CPU request is For memory it looks like CAPZ is setting it at 9Gi right now. Not sure how comparable their jobs are, but maybe it's a good starting point? |
CAPZ is not running the workload clusters in the ProwJob pod |
Notes from talking to @ameukam :
There is no good method to determine how much we need or a baseline. So we kind of need to test and iterate. |
we should probably start with low values (2 GB ?) of reqs and limits and iterate if the jobs are oomkilled. |
This will require quite a lot of trial-and-error. We did this in the past on our own Prow instance and it's sometimes hard to figure out that your job failed because of an oom kill. (also memory usage of a job is usually not constant over longer periods of time) |
Do we have to set memory on |
Ah, I think we can run the jobs locally with |
That's a good idea. But maybe better with a prometheus and metrics-server to hopefully get the highest value (instead of manually grabbing from k top). |
Yes. CPU/memory reqs and limits are required to run on community-owned (GKE/EKS) clusters. |
@fabriziopandini this was the issue for migrating prowjobs mentioned in the office hours yesterday. |
@lentzi90 thanks, reporting here some notes from the office hours discussion: call for action about moving CI jobs to EKS
|
Hey all, with the release of 1.5.0 on the books. Would it be a good time to start moving some of the CAPI jobs over to eks? Should we start with 1.3 jobs and see how it goes? |
+1 |
+1 /cc @nawazkh |
/unassign |
/assign rjsadow Because you already opened and merged the first PR 🎉 Thanks for taking this over! |
Note: Dashboard which helps to fine-tune the memory/cpu requests: https://monitoring-eks.prow.k8s.io/d/96Q8oOOZk/builds?orgId=1&refresh=30s&var-org=kubernetes-sigs&var-repo=cluster-api&var-job=periodic-cluster-api-e2e-workload-upgrade-1-18-1-19-release-1-3&var-build=All |
Just an update. Merged a bunch of PRs. Now all 1.3 jobs should run on the eks cluster. Please verify though |
But looks like the test infra PRs did not all link this issue. Can we please correct that? Very hard to track what we did otherwise. Maybe someone can post a quick summary here as well |
Summary:
|
For completeness on ^, you missed kubernetes/test-infra#30340 after
;) |
xref: #8426 (comment)
|
As Killian wrote in kubernetes/test-infra#30365 (review) let's wait with further migrations until release-1.3 is fixed and then stable for a bit |
It seems like 1.3 jobs have stabilized relatively well since the public ip changes. How does everyone feel about pushing forward with the 1.4 migration in kubernetes/test-infra#30365? |
It would be good to get these flakes fixed #9379 |
Looks like the last occurency of |
@furkatgofurov7 Can we already open a PR to move the 1.5 jobs? Just so we can already starting reviewing it to get it ready |
As discussed offline CI team members can handle this (@nawazkh @adilGhaffarDev @kranurag7 Sunnat), otherwise I am happy to prepare it, please let me know |
@kranurag7 is making the PR. |
@ameukam kubernetes/test-infra#31386 is getting merged now. Do you have an easy way to check if we migrated all jobs (just to make sure we didn't miss any) |
We could check https://prow.k8s.io/?repo=kubernetes-sigs%2Fcluster-api&cluster=default to see if any more jobs are running on the default cluster after that PR is merged. |
We had a list being tracked #9609 (comment) I just validated quickly, and I think we migrated all the jobs on the list. I'll get a second round of validation using the above link shared by Jakob. |
Thx! Yup was mostly asking Arnaud because I think he has a tool / script to generate these lists |
We have a tool in
Specific Jobs
|
Perfect. Then I would close this issue for core CAPI. /close |
@sbueringer: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What would you like to be added (User Story)?
This issue is for discussing and tracking efforts to migrate existing prowjobs over to the new infrastructure provided by test-infra.
Detailed Description
One point at the office hours at 10th May 2023 was:
Migrating jobs over from the google infrastructure might also help in enabling folks at test-infra to look into issues e.g. to help debugging
because on the non-default target prow cluster, all folks of test-infra are able to take a look into it.
Anything else you would like to add?
No response
Label(s) to be applied
/kind feature
/area ci
The text was updated successfully, but these errors were encountered: