Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate merge-blocking jobs to dedicated cluster: pull-kubernetes-bazel-build #19073

Closed
spiffxp opened this issue Aug 31, 2020 · 10 comments
Closed
Assignees
Labels
area/jobs kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. sig/testing Categorizes an issue or PR as relevant to SIG Testing.

Comments

@spiffxp
Copy link
Member

spiffxp commented Aug 31, 2020

What should be cleaned up or changed:

This is part of #18550

To properly monitor the outcome of this, you should be a member of [email protected]. PR yourself into https://github.com/kubernetes/k8s.io/blob/master/groups/groups.yaml#L603-L628 if you're not a member.

Migrate pull-kubernetes-bazel-build to k8s-infra-prow-build by adding a cluster: k8s-infra-prow-build field to the job:

NOTE: migrating this job is not as straightforward as some of the other #18550 issues, because:

  • other flags need to be removed to migrate it off of RBE
- --config=remote
- --remote_instance_name=projects/k8s-prow-builds/instances/default_instance

Once the PR has merged, note the date/time it merged. This will allow you to compare before/after behavior.

Things to watch for the job

Things to watch for the build cluster

Keep this open for at least 24h of weekday PR traffic. If everything continues to look good, then this can be closed.

/wg k8s-infra
/sig testing
/area jobs

@spiffxp spiffxp added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Aug 31, 2020
@k8s-ci-robot k8s-ci-robot added wg/k8s-infra sig/testing Categorizes an issue or PR as relevant to SIG Testing. area/jobs labels Aug 31, 2020
@spiffxp
Copy link
Member Author

spiffxp commented Aug 31, 2020

Suggest a similar strategy to #19070 (comment) to migrate: first try a canary job

See if a CI variant has already been migrated or canaried to see if there has been any impact

@cpanato
Copy link
Member

cpanato commented Nov 5, 2020

/assign

@rayandas
Copy link
Member

rayandas commented Nov 5, 2020

/assign

@ameukam
Copy link
Member

ameukam commented Nov 5, 2020

/assign

@cpanato
Copy link
Member

cpanato commented Nov 5, 2020

/unassign

@rayandas
Copy link
Member

rayandas commented Nov 5, 2020

/unassign

@spiffxp
Copy link
Member Author

spiffxp commented Nov 23, 2020

#19855 merged 2020-11-23 10:29 PT

Holding open to verify jobs continue to pass / haven't changed behavior negatively

@spiffxp
Copy link
Member Author

spiffxp commented Nov 23, 2020

https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-bazel-build&grid=old&graph-metrics=test-duration-minutes

Bump in duration from ~10min to ~43min is not too surprising, we saw a similar sized jump when migrating pull-kubernetes-bazel-test (ref: #19070 (comment))

Not enough traffic yet to understand if flakiness has increased vs. PR's with broken builds

@spiffxp
Copy link
Member Author

spiffxp commented Jan 8, 2021

There's been plenty of traffic by this point. Flakiness (or at least the failure rate) has not increased.

Screenshot from local grafana instance pointed at k8s-gubernator:builds: the jump in daily 99p duration is when we switched, and failure rate has trended down since then
Screen Shot 2021-01-08 at 11 15 02 AM

Also checked the metrics-explorer graphs for resource usage, the job resources are not over/under-provisioned

Thanks @ameukam!

/close

@k8s-ci-robot
Copy link
Contributor

@spiffxp: Closing this issue.

In response to this:

There's been plenty of traffic by this point. Flakiness (or at least the failure rate) has not increased.

Screenshot from local grafana instance pointed at k8s-gubernator:builds: the jump in daily 99p duration is when we switched, and failure rate has trended down since then
Screen Shot 2021-01-08 at 11 15 02 AM

Also checked the metrics-explorer graphs for resource usage, the job resources are not over/under-provisioned

Thanks @ameukam!

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/jobs kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

5 participants