Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reenable kubeadm presubmit test. #2976

Closed
wants to merge 1 commit into from

Conversation

pipejakob
Copy link
Contributor

This had been previously disabled by #2568. Adding the job back to the bazel pipeline and reenabling.

Merging this is blocked by kubernetes/kubernetes#46864 which will fix kubeadm join, but I wanted to get the PR out earlier to get feedback and make sure I clear all presubmit checks.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 6, 2017
@pipejakob pipejakob force-pushed the reenable-kubeadm-pull branch from 9dd42a6 to 77e23e4 Compare June 6, 2017 09:21
@spiffxp
Copy link
Member

spiffxp commented Jun 7, 2017

/lgtm
/cc @fejta @krzyzacy

@k8s-ci-robot k8s-ci-robot requested review from fejta and krzyzacy June 7, 2017 22:43
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 7, 2017
@luxas
Copy link
Member

luxas commented Jun 8, 2017

@pipejakob when the kubeadm is green again (that is after kubernetes/kubernetes#46879), we can merge this

@luxas
Copy link
Member

luxas commented Jun 8, 2017

/assign @mikedanese @roberthbailey

@roberthbailey
Copy link
Contributor

lgtm once we are sure the test is working.

The latest run I see on testgrid is failing due to not having enough quota to even stand up VMs on which to run kubeadm:

W0609 02:00:23.099] Error applying plan:
W0609 02:00:23.099] 
W0609 02:00:23.099] 1 error(s) occurred:
W0609 02:00:23.099] 
W0609 02:00:23.100] * google_compute_instance.e2e-3791-master: Error creating instance: googleapi: Error 403: Quota 'CPUS' exceeded. Limit: 24.0, quotaExceeded
W0609 02:00:23.100] 
W0609 02:00:23.100] Terraform does not automatically rollback in the face of errors.
W0609 02:00:23.101] Instead, your Terraform state file has been partially updated with
W0609 02:00:23.101] any resources that successfully completed. Please address the error
W0609 02:00:23.101] above and apply again to incrementally change your infrastructure.
W0609 02:00:23.101] make[1]: *** [do] Error 1
W0609 02:00:23.102] make: *** [deploy-cluster] Error 2

This had been previously disabled by
kubernetes#2568. Adding the job back
to the bazel pipeline and reenabling.

Also, remove sporadic trailing whitespace.
@pipejakob pipejakob force-pushed the reenable-kubeadm-pull branch from 77e23e4 to 5e0a9cf Compare June 17, 2017 02:55
@pipejakob
Copy link
Contributor Author

Rebased, and increased the quota for this project to 60 cores. I'm not sure how bursty presubmit runs will be, but prow makes no attempt to serialize runs of the same job, it just spawns jobs as quickly as PRs need.

@luxas
Copy link
Member

luxas commented Jun 17, 2017

/assign @fejta @krzyzacy
PTAL

@krzyzacy
Copy link
Member

@pipejakob ready to merge?

@roberthbailey
Copy link
Contributor

The tests against master are now running green.

Copy link
Member

@luxas luxas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it run automatically on cmd/kubeadmchanges?

Or will it already since bazel is always running and this job runs after the bazel one?

@luxas
Copy link
Member

luxas commented Jun 26, 2017

@pipejakob This job currently fails with:

W0626 06:58:51.035] 2017/06/26 06:58:51 util.go:129: Running: gcloud auth activate-service-account --key-file=/etc/service-account/service-account.json
W0626 06:58:51.670] Activated service account credentials for: [[email protected]]
W0626 06:58:51.696] 2017/06/26 06:58:51 util.go:131: Step 'gcloud auth activate-service-account --key-file=/etc/service-account/service-account.json' finished in 660.196024ms
W0626 06:58:51.745] 2017/06/26 06:58:51 main.go:161: Saved XML output to /workspace/k8s.io/kubernetes/_artifacts/junit_runner.xml.
W0626 06:58:51.746] 2017/06/26 06:58:51 util.go:198: Running: bash -c . hack/lib/version.sh && KUBE_ROOT=. kube::version::get_version_vars && echo "${KUBE_GIT_VERSION-}"
W0626 06:58:52.251] 2017/06/26 06:58:52 util.go:200: Step 'bash -c . hack/lib/version.sh && KUBE_ROOT=. kube::version::get_version_vars && echo "${KUBE_GIT_VERSION-}"' finished in 504.717402ms
W0626 06:58:52.251] 2017/06/26 06:58:52 main.go:195: Something went wrong: failed to acquire k8s binaries: open /go/src/k8s.io/kubernetes/_output/gcs-stage: no such file or directory
W0626 06:58:52.252] +(/workspace/e2e-runner.sh:1): main(): chmod -R o+r /workspace/k8s.io/kubernetes/_artifacts
W0626 06:58:52.255] Traceback (most recent call last):
W0626 06:58:52.256]   File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_e2e.py", line 535, in <module>
W0626 06:58:52.256]     main(parse_args())
W0626 06:58:52.257]   File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_e2e.py", line 449, in main
W0626 06:58:52.257]     mode.start(runner_args)
W0626 06:58:52.257]   File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_e2e.py", line 219, in start
W0626 06:58:52.258]     check_env(env, self.runner, *args)
W0626 06:58:52.258]   File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_e2e.py", line 56, in check_env
W0626 06:58:52.258]     subprocess.check_call(cmd, env=env)
W0626 06:58:52.258]   File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
W0626 06:58:52.277]     raise CalledProcessError(retcode, cmd)
W0626 06:58:52.278] subprocess.CalledProcessError: Command '('/workspace/e2e-runner.sh', '--up', '--down', '--extract=local', '--kubernetes-anywhere-kubernetes-version=latest', '--deployment=kubernetes-anywhere', '--timeout=55m', '--check-leaked-resources=false', '--kubernetes-anywhere-path=/workspace/kubernetes-anywhere', '--kubernetes-anywhere-phase2-provider=kubeadm', '--kubernetes-anywhere-cluster=e2e-5296', '--kubernetes-anywhere-kubeadm-version=gs://kubernetes-release-dev/bazel/48042/master:53a66020e4bf54d66aab5b9f625af7d10ed4c3f5,48042:c257eb358f6cebe35aa871a817c04c17147d98bc/bin/linux/amd64/')' returned non-zero exit status 1
E0626 06:58:52.282] Build failed
I0626 06:58:52.282] process 471 exited with code 1 after 0.0m
E0626 06:58:52.283] FAIL: pull-kubernetes-e2e-kubeadm-gce
I0626 06:58:52.283] Upload result and artifacts...
I0626 06:58:52.284] Gubernator results at https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/48042/pull-kubernetes-e2e-kubeadm-gce/5296
I0626 06:58:52.285] Call:  gsutil -m -q -o GSUtil:use_magicfile=True cp -r -c -z log,txt,xml _artifacts gs://kubernetes-jenkins/pr-logs/pull/48042/pull-kubernetes-e2e-kubeadm-gce/5296/artifacts
I0626 06:58:53.877] process 544 exited with code 0 after 0.0m
I0626 06:58:53.878] Call:  git rev-parse HEAD
I0626 06:58:53.885] process 888 exited with code 0 after 0.0m
... skipping 2 lines ...
I0626 06:58:54.785] Call:  gsutil -q cat 'gs://kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-kubeadm-gce/jobResultsCache.json#1497892018921423'
I0626 06:58:55.972] process 1027 exited with code 0 after 0.0m
I0626 06:58:55.980] Call:  gsutil -q -h Content-Type:application/json -h x-goog-if-generation-match:1497892018921423 cp - gs://kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-kubeadm-gce/jobResultsCache.json
I0626 06:58:57.736] process 1167 exited with code 0 after 0.0m
I0626 06:58:57.738] Call:  gsutil stat gs://kubernetes-jenkins/pr-logs/pull/48042/pull-kubernetes-e2e-kubeadm-gce/jobResultsCache.json
W0626 06:58:58.679] No URLs matched: gs://kubernetes-jenkins/pr-logs/pull/48042/pull-kubernetes-e2e-kubeadm-gce/jobResultsCache.json
E0626 06:58:58.679] Build failed
I0626 06:58:58.679] process 1337 exited with code 1 after 0.0m
I0626 06:58:58.680] Call:  gsutil -q -h Content-Type:application/json -h x-goog-if-generation-match:0 cp - gs://kubernetes-jenkins/pr-logs/pull/48042/pull-kubernetes-e2e-kubeadm-gce/jobResultsCache.json
I0626 06:59:00.138] process 1475 exited with code 0 after 0.0m
I0626 06:59:00.139] Call:  gsutil -q -h Content-Type:application/json cp - gs://kubernetes-jenkins/pr-logs/pull/48042/pull-kubernetes-e2e-kubeadm-gce/5296/finished.json
I0626 06:59:01.516] process 1645 exited with code 0 after 0.0m
I0626 06:59:01.517] Call:  gsutil -q -h Content-Type:text/plain -h 'Cache-Control:private, max-age=0, no-transform' cp - gs://kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-kubeadm-gce/latest-build.txt

ref: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/48042/pull-kubernetes-e2e-kubeadm-gce/5296/?log#log

@timothysc
Copy link
Member

I'll bring this up in sig-testing today. /cc @fejta

@pipejakob
Copy link
Contributor Author

As part of the (completely understandable) security lockdown of our prow cluster, I've lost my direct access recently. I just pinged @fejta + @spxtr out of band to request temporary access to debug and fix up these recent failures.

@ixdy
Copy link
Member

ixdy commented Jun 27, 2017

when last I looked at this job, I think I concluded that --extract=local is being misused here (with kubetest), but I don't know what would be more appropriate.

@timothysc
Copy link
Member

What is the state on this I think this should be an imperative for 1.8.

@luxas
Copy link
Member

luxas commented Jul 10, 2017 via email

@pipejakob
Copy link
Contributor Author

I've been diving into debugging this and found a few issues that I need to fix (beyond just rebasing). It might take a little longer than expected.

@spiffxp
Copy link
Member

spiffxp commented Jul 26, 2017

/unassign

@luxas
Copy link
Member

luxas commented Jul 26, 2017

ping @pipejakob Are you able to look at this anytime soon?
Otherwise we should assign someone else to take a shot at it; will be crucial to have soon

cc @timothysc @roberthbailey

@pipejakob
Copy link
Contributor Author

A quick update on this: the presubmit job was using --extract local, which means to use the local artifacts from the current build, which we don't do. I'm not sure how a build was being triggered before (since I wasn't passing --build to kubetest), but the e2es were passing before, and now the binaries can't be found, which makes sense.

One easy option is to use --extract ci/latest to just grab the latest e2e.test from another CI build, but that has a lot of downsides. It wouldn't actually exercise any e2e test changes in the current PR, which means someone could still very easily merge a breaking change, and the PR to fix it would still be failing the e2e tests.

This job is chained off of the existing bazel presubmit test, so another option (my preference) is to add a new kubetest extractStrategy to be able to pull the bazel build that was already run for the candidate PR and reuse those binaries.

We could also repeat the build in the kubeadm e2e, but that's also problematic: our make release and make quick-release builds use the Docker build image, but this job already runs within a container, and EngProd generally advises against using Docker-in-Docker. The bazel build doesn't require launching a new container, but then we have to make sure that the bazel version and environment stays well in sync with the existing bazel build image so that the build matches.

I'm open to other ideas, but I think the best option is to add support for the new kubetest extraction strategy to reuse the binaries built during the bazel presubmit job. I'll start coding that up.

@luxas
Copy link
Member

luxas commented Jul 31, 2017

I'm open to other ideas, but I think the best option is to add support for the new kubetest extraction strategy to reuse the binaries built during the bazel presubmit job. I'll start coding that up.

@pipejakob SGTM

@fejta
Copy link
Contributor

fejta commented Jul 31, 2017

I would like to see the build job do kubetest --stage=gs://something-specific-to-the-pr and then all the chained e2e jobs do kubetest --extract=gs://something-specific-to-the-pr

@pipejakob
Copy link
Contributor Author

@fejta Thanks for the suggestion. I think that'll turn out more simple than what I was doing. I'll give it a shot.

@fejta
Copy link
Contributor

fejta commented Aug 7, 2017

FYI please sync up with @BenTheElder who is also interested in refactoring the e2e jobs to do this --extracting

@fejta
Copy link
Contributor

fejta commented Aug 7, 2017

/cc @BenTheElder

@spxtr
Copy link
Contributor

spxtr commented Aug 23, 2017

/test all

@k8s-ci-robot
Copy link
Contributor

@pipejakob: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-test-infra-verify-bazel 5e0a9cf link /test pull-test-infra-verify-bazel
pull-test-infra-bazel 5e0a9cf link /test pull-test-infra-bazel
pull-test-infra-verify-gofmt 5e0a9cf link /test pull-test-infra-verify-gofmt
pull-test-infra-verify-govet 5e0a9cf link /test pull-test-infra-verify-govet

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@luxas
Copy link
Member

luxas commented Aug 23, 2017

@pipejakob have you had a chance to look at this yet?

@pipejakob
Copy link
Contributor Author

My related PRs to support using the correct e2e.test binary have now been merged, but this is going to be blocked on kubernetes/kubernetes#50760, since we currently can't get a green run of kubeadm jobs (despite all e2e tests passing), so I don't want to add in an already failing presubmit blocker.

Since this PR is particularly painful to rebase and keeps hitting conflicts with changes to config.yaml, I'd like to wait until we get kubernetes/kubernetes#50760 sorted out before moving forward. I can close this for now to get it out of people's review queues and reopen when it's actually ready to be reviewed again.

@pipejakob pipejakob closed this Aug 23, 2017
@luxas
Copy link
Member

luxas commented Aug 23, 2017

Okay, thanks @pipejakob

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.