Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm e2e testing #190

Closed
5 of 13 tasks
pipejakob opened this issue Mar 2, 2017 · 36 comments
Closed
5 of 13 tasks

kubeadm e2e testing #190

pipejakob opened this issue Mar 2, 2017 · 36 comments
Assignees

Comments

@pipejakob
Copy link
Contributor

pipejakob commented Mar 2, 2017

This issue is to break out and track work for the opaque "e2e testing" subtask of #63.

This is a work-in-progress, and still needs milestones defined for what we need for kubeadm beta vs. GA.

  • Create post-submit job to test up/down kubeadm clusters (owner: @pipejakob)
    • Extend post-submit job to exercise e2e Conformance tests of kubeadm clusters
  • Create pre-submit job to exercise e2e Conformance tests of kubeadm clusters (owner: @pipejakob)
    • Run pre-submit job automatically on kubeadm changes (owner: @pipejakob)
    • Graduate pre-submit job to block pull requests on failure
  • Create generic e2e tests for the BootstrapSigner and TokenCleaner controllers (owner: @jbeda)
    • This is a requirement for moving these controllers to beta
  • Create generic e2e tests for the BootstrapTokenAuthenticator authentication module (owner: ?)
    • This is a requirement for moving this apiserver module to beta
  • Add kubeadm-specific e2e tests for specific functionality: (owner: @dmmcquay)
    • Cluster creation using tokens
    • Cluster creation using file discovery
    • Cluster creation using HTTPS discovery
    • Token creation and deletion after cluster initialization
    • Verify that the kubeadm phase commands do exactly what they should and work well together

CC @kubernetes/sig-cluster-lifecycle-misc

@dmmcquay
Copy link

dmmcquay commented Mar 2, 2017

@pipejakob I'll take ownership of kubeadm-specific e2e tests

@pipejakob
Copy link
Contributor Author

@dmmcquay Awesome. I knew you wanted to own that area, I just wasn't sure what the full testplan was (the sublist on the next line). Feel free to update it with the scenarios you hope to exercise.

@luxas
Copy link
Member

luxas commented Mar 2, 2017

@jbeda should also craft some e2e's for the BootstrapSigner and TokenCleaner

@pipejakob
Copy link
Contributor Author

pipejakob commented Mar 2, 2017

I've been tackling the first one of these (e2e Conformance tests) by adding kubeadm support to kubernetes-anywhere, and support for kubernetes-anywhere as a deployment option to hack/e2e.go (which has now been migrating to test-infra/kubetest). You can bring up, test, and tear down a GCE kubeadm cluster via something like:

$ cd ~/go/src/k8s.io/kubernetes
$ export PROJECT=my-gcp-project
$ export GOOGLE_APPLICATION_CREDENTIALS=<path_to_credentials_file>
$ go run hack/e2e.go -- -v --deployment kubernetes-anywhere --kubernetes-anywhere-path --kubernetes-anywhere-phase2-provider kubeadm --kubernetes-anywhere-cluster my-e2e-test --up --test --down

You can also specify --kubernetes-anywhere-kubeadm-version <gs://link-to-build> to use a custom kubeadm build instead of just the latest stable debs published (under review: kubernetes/test-infra#2094)

This is all exercised in the test-infra CI job that gets run by prow here. It uses a custom Docker image and runner.

I'll translate this into a more comprehensive document in the repo so it's easier to discover and get started testing. I just wanted to add this comment so anyone else would have the pointers necessary to get started and poke around until a more complete document is written.

@pipejakob
Copy link
Contributor Author

@luxas or anyone else: feel free to add directly to the checklist in the description for scenarios you think we should test for regressions. After a round of brainstorming, I'll drive trying to get SIG consensus around what deliverables belong to what milestones and clean up the list.

@luxas
Copy link
Member

luxas commented Mar 2, 2017

@pipejakob Thanks! I added some bullets to the list.

A document describing this flow in this repo will be awesome.
I also hope we can start running the e2e Conformance tests as soon as possible (preferably in time for v1.6 so we can show folks we have something that detects regressions), feel free to ping me there if you need help with setting it up.

@pipejakob
Copy link
Contributor Author

A few days ago, I sent out a bundle of PRs to get the CI kubeadm e2e test back to green (kubernetes/test-infra#2179 kubernetes/test-infra#2180 kubernetes-retired/kubernetes-anywhere#352 kubernetes-retired/kubernetes-anywhere#353 kubernetes/test-infra#2182 kubernetes/test-infra#2184 kubernetes/test-infra#2183), and most have been merged now. My testing also found a kubeadm bug I was able to fix.

Now, the latest issue I'm trying to work through is getting a pod network working on a 1.6 cluster. Weave Net fails because insecure localhost:8080 access has been removed. I had problems getting Calico or Canal working even after applying the ubeadm.alpha.kubernetes.io/role=master label to the master node, although haven't had time to dig into them too deeply. I had the most luck with Flannel + its RBAC bindings, but then I was seeing the kube-dns pod crash and restart pretty routinely. My temporary solution is to actually disable e2e tests in our CI e2e job. This isn't as counter-productive as it sounds, though, since it would still exercise a full cluster-up and make sure all nodes joined correctly before tearing the cluster down, so it could still be used to prevent regressions while a solution to re-enable Conformance testing is developed.

@luxas @dmmcquay If either of you have bandwidth to figure out a reliable pod network to install on 1.6 clusters (or know of one already), that would be very helpful. I'll continue to debug myself in the meantime.

@luxas
Copy link
Member

luxas commented Mar 10, 2017

@pipejakob Weave doesn't work due to the RBAC enablement, not the 8080 thing directly.
See weaveworks/weave#2801 for a manifest that works 👍

@pipejakob
Copy link
Contributor Author

Ah. I had just tried the old instructions of kubectl apply -f https://git.io/weave-kube (from here) and noticed a lot of failures in its logs about not being able to connect to localhost:8080. I didn't know about more upstream development. Are you saying that weaveworks/weave#2801 still doesn't work with RBAC, or that it should be fully good to go for 1.6 clusters? In either case, I'll jump on testing it out. Thanks!

@luxas
Copy link
Member

luxas commented Mar 10, 2017

@pipejakob It does work with v1.6

@pipejakob
Copy link
Contributor Author

@luxas So, the weave-net pods seem to come up fine, but kube-dns remains stuck in ContainerCreating afterward (I recreated a fresh cluster a few times to make sure it wasn't just a race condition). This is using a CI build of the kubeadm debs, --kubernetes-version latest, and the weave manifest from the PR you linked. I don't know which piece is at fault yet, and I'll keep digging today.

@pipejakob
Copy link
Contributor Author

I've gotten the kubeadm e2e CI job back to green:

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-gce/480/

The actual Conformance tests are still disabled due to the kube-dns problems I was having, but should be fixed in beta.3. I'll verify a successful e2e run and re-enable them, then create a new prow entry to run the job during pulls as well.

@pipejakob
Copy link
Contributor Author

No luck with beta.3.

Let me elaborate on my setup and the problem I'm seeing, because I think the other issue @luxas brought up might have been a red herring. I'm using 5 GCE hosts: 1 master and 4 nodes. On the master:

$ kubeadm init --token <omitted> --apiserver-bind-port 443 --apiserver-advertise-address 35.184.182.170 --kubernetes-version latest
$ KUBECONFIG=/etc/kubernetes/admin.conf kubectl apply -f https://raw.githubusercontent.com/luxaslabs/weave/d18e9cf56f69bf01c61178df47806488e96793c8/prog/weave-kube/weave-daemonset-k8s-HEAD.yaml

Then, I have the nodes join, each with:

$ kubeadm join --token <omitted> 10.128.0.2:443

At this point, everything looks successful, except that kube-dns is spinning with status ContainerCreating:

$ kubectl get pods -n kube-system
NAME                                             READY     STATUS              RESTARTS   AGE
etcd-e2e-kubeadm-gce-master                      1/1       Running             0          2m
kube-apiserver-e2e-kubeadm-gce-master            1/1       Running             0          2m
kube-controller-manager-e2e-kubeadm-gce-master   1/1       Running             0          2m
kube-dns-3913472980-cnpct                        0/3       ContainerCreating   0          3m
kube-proxy-5061q                                 1/1       Running             0          3m
kube-proxy-h3tvn                                 1/1       Running             0          3m
kube-proxy-pq0xf                                 1/1       Running             0          3m
kube-proxy-r5mf9                                 1/1       Running             0          3m
kube-proxy-xg7q1                                 1/1       Running             0          3m
kube-scheduler-e2e-kubeadm-gce-master            1/1       Running             0          2m
weave-net-5htf0                                  2/2       Running             0          1m
weave-net-c7vm6                                  2/2       Running             0          1m
weave-net-jctdk                                  2/2       Running             0          1m
weave-net-mqnt7                                  2/2       Running             0          1m

If I describe the pod, I see:

$ kubectl describe pod kube-dns-3913472980-cnpct -n kube-system
...
Events:
  FirstSeen	LastSeen	Count	From				SubObjectPath	Type		Reason		Message
  ---------	--------	-----	----				-------------	--------	------		-------
  4m		4m		1	default-scheduler				Normal		Scheduled	Successfully assigned kube-dns-3913472980-cnpct to e2e-kubeadm-gce-master
  4m		4m		1	kubelet, e2e-kubeadm-gce-master			Warning		FailedSync	Error syncing pod, skipping: failed to "CreatePodSandbox" for "kube-dns-3913472980-cnpct_kube-system(f2532e76-09ae-11e7-8110-42010a800002)" with CreatePodSandboxError: "CreatePodSandbox for pod \"kube-dns-3913472980-cnpct_kube-system(f2532e76-09ae-11e7-8110-42010a800002)\" failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod \"kube-dns-3913472980-cnpct_kube-system\" network: cni config uninitialized"

  4m	4s	19	kubelet, e2e-kubeadm-gce-master		Normal	SandboxChanged	Pod sandbox changed, it will be killed and re-created.
  4m	4s	19	kubelet, e2e-kubeadm-gce-master		Warning	FailedSync	Error syncing pod, skipping: failed to "KillPodSandbox" for "f2532e76-09ae-11e7-8110-42010a800002" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"kube-dns-3913472980-cnpct_kube-system\" network: cni config uninitialized"

The odd things that stick out to me are that weave-net pods run on every node (but not the master), and have successfully created /etc/cni/net.d/10-weave.conf on those nodes. The kube-dns pod, however, is getting scheduled on the master, which has no /etc/cni/net.d/ directory at all, so it seems to be rightly complaining about CNI not being initialized.

I'm still new to the CNI architecture, and am not sure what the expectations are between the network provider and kubelet, so I'm not sure if it's wrong that kube-dns is trying to run on the master instead of one of the nodes, or wrong that there is no weave-net pod that runs on the master to create the /etc/cni/net.d configuration, or some other mechanism entirely that's supposed to configure CNI on the master.

@yujuhong
Copy link

@pipejakob @luxas there is no cni configuration on the node, so kubelet failing to setup/teardown network seems ok to me.

It sounds to me that the weve-net pods should also run on the master to populate the files.

/cc @kubernetes/sig-network-misc

@pipejakob
Copy link
Contributor Author

@luxas Since your setup gets you further than mine, can you confirm that either you have CNI configured on the master, or kube-dns running on a non-master node? I wanted to confirm in which direction I should try to take the fix.

@luxas
Copy link
Member

luxas commented Mar 15, 2017

@pipejakob Yes, I got past that issue, have a fix for it soon, but seems like we're standing in front of a third issue: weaveworks/weave#2850 (a CNI breaking-change, at least Weave is affected)

Will ping you here as soon as I've updated my manifest so you can try again

@pipejakob
Copy link
Contributor Author

@luxas Thanks!

@jbeda
Copy link

jbeda commented Mar 15, 2017

Wow -- CNI breaking change? That is going to be a huge problem.

@dcbw
Copy link
Member

dcbw commented Mar 15, 2017

@jbeda you can continue using older CNI code if you want to in the plugin, kubernetes will still work with that version. But by vendoring in the new CNI code, you opt into the new features the 0.5.0 release provides.

it's not a "huge problem"; you either opt in or you don't. But if you opt in, obviously some things change.

@dcbw
Copy link
Member

dcbw commented Mar 15, 2017

Also, even though Kubernetes does not yet understand the new CNI spec return format, the weave plugin should be handling that OK by returning the result in the format that Kubernetes expects (CNI spec version 0.2.0) as long as the CNI network configuration JSON does not set a "cniVersion" greater than 0.2.0. Is that the case? If the config JSON sets "cniVersion" 0.3.0 or higher then that's a misconfiguration and yes, kube will fail to interpret the result.

@caseydavenport
Copy link
Member

A bit out of touch from this discussion, so maybe this has already been discussed / solved - the various discussions I've seen here and in #sig-network have led me to wonder what we can be doing to decouple this e2e testing from third-party CNI manifests which may or may not be stable.

I think we should be considering what we can do to use upstream CNI plugins (e.g bridge) for the e2es.

This certainly becomes easier when kubenet is itself a CNI plugin like any other.

@jbeda
Copy link

jbeda commented Mar 15, 2017

Both weave and calico don't set the cniVersion member in their drop ins. I suspect this is a common issue and the fact that no one errored on this in the past led to this behavior.

If this is really the problem, a sane thing to do now would be to have "no version" mean 0.2 with a warning. If users want 0.3 they should have to set it explicitly. At some point in the future deprecate 0.2 and make sure everyone is updated.

Note that calico isn't complaining about the env variable but the kubelet still has errors like this:

Mar 15 23:38:48 kubeadm-master kubelet[5449]: E0315 23:38:48.303056    5449 docker_sandbox.go:176] Failed to stop sandbox "fa74893340ba1e54ccbd595f752fda45085a6c020790ae04314e084d4e1bf822": Error response from daemon: {"message":"No such contai
Mar 15 23:38:48 kubeadm-master kubelet[5449]: E0315 23:38:48.303119    5449 remote_runtime.go:109] StopPodSandbox "fa74893340ba1e54ccbd595f752fda45085a6c020790ae04314e084d4e1bf822" from runtime service failed: rpc error: code = 2 desc = Network
Mar 15 23:38:48 kubeadm-master kubelet[5449]: E0315 23:38:48.303129    5449 kuberuntime_gc.go:138] Failed to stop sandbox "fa74893340ba1e54ccbd595f752fda45085a6c020790ae04314e084d4e1bf822" before removing: rpc error: code = 2 desc = NetworkPlug

@pipejakob
Copy link
Contributor Author

Sent out a few more PRs to run kubeadm e2es against PRs and fix a race condition in them that's causing flakes.

@pipejakob
Copy link
Contributor Author

I confirmed that the newest weave-net manifest fixes all of the issues I had been seeing. I'm running the Conformance tests locally now, and will report any findings. If they're green, I'll re-enable them in the CI/pull jobs as well.

@pipejakob
Copy link
Contributor Author

I just had kubernetes-retired/kubernetes-anywhere#363 merged to add support for weave-net to kubernetes-anywhere (which the kubeadm e2e tests use to bring up a cluster), and now have kubernetes/test-infra#2347 in review to take advantage of it. My local testing shows that this allows us to turn Conformance testing back on and have a completely green pass.

@pipejakob
Copy link
Contributor Author

@caseydavenport Sorry I missed your message in this issue; it was right around the CNI firedrill.

I like the suggestion around using something like bridge for e2e tests, but I don't think I would completely ditch other scenarios that exercise third-party CNI providers for it. It's probably a mistake for this issue to refer to "the" kubeadm e2e test, or any other implication that it stands alone, since this is just the first of its kind. I will say that the great thing about this first test is that it exercises the actual instructions we give users in our documentation. I haven't heard of any real users using bridge with kubeadm, and we certainly don't have instructions for it, so it's unlikely that it's important that kubeadm+bridge works, it's just trying to remove the uncertainty around an arbitrary third-party provider breaking.

However, I don't think this first e2e will or should be considered the gold standard of whether or not kubeadm is broken, but just one signal. I would love to add other e2e scenarios that duplicate the setup but exercise other third-party CNI providers that we advertise in the official documentation. If a kubeadm commit causes them all to fail simultaneously, that's a pretty good signal that we broke kubeadm itself. If a commit causes only some of them to fail, then maybe we've uncovered an underspeced contract or just proactively helped find a bug in those specific providers. This is all predicated on the fact that these jobs aren't blocking PRs, but providing clues to possible regressions over time. When we get closer to having a PR-blocking job (if that even happens), there should be another discussion and consensus to decide the minimal scenario we consider to be indicative of kubeadm's health, since some people consider having a flaky PR-blocking job as worse than having no job at all.

In the meantime, I definitely see incremental value in adding another e2e job using bridge (or something like it) as another signal. Would you mind opening a separate issue to track that? I think it's a meaty enough topic to warrant a dedicated issue for discussion / prototyping and I'd like to get your input on the best strategy of configuring something capable of passing Conformance testing.

@caseydavenport
Copy link
Member

@pipejakob thanks for the response :)

I think what you've said above makes a lot of sense, and testing a number of providers will likely give us better signal than just one. I've opened #218 to discuss further.

@pipejakob
Copy link
Contributor Author

There was enough fire-fighting at the end of last week around the 1.6.1 fixes that I forgot to mention a few updates on this issue:

@klizhentas
Copy link

klizhentas commented May 11, 2017

@pipejakob I'm working on the issue #218 and was digging into current test infra for kubeadm. As a result I wrote this guide, can put it to the right place once I figure out where to put it. Comments are appreciated :)

Hacking on Kubeadm e2e tests

Set up Tools

GOPATH

This guide is using GOPATH, but you are welcome to switch it to any other directory

Set up gopath here:

https://golang.org/doc/code.html#GOPATH

Install jsonnet

git clone [email protected]:google/jsonnet.git
cd jsonnet
# if you don't have GOPATH set, use alternative location
cp jsonnet $GOPATH/bin

Install JQ

On Ubuntu/Debian:

sudo apt-get -y install jq

Install Terraform

Terraform version must be of 0.7.2 otherwise e2e tests wont work.

curl -o /tmp/terraform_0.7.2_linux_amd64.zip https://releases.hashicorp.com/terraform/0.7.2/terraform_0.7.2_linux_amd64.zip
cd $GOPATH/bin
unzip /tmp/terraform_0.7.2_linux_amd64.zip
rm -f /tmp/terraform_0.7.2_linux_amd64.zip

Set up your GCE and

Install GCE SDK

Read the guide here: https://cloud.google.com/sdk/downloads

Clone Repos

mkdir -p $GOPATH/src/github.com/kuberentes
cd $GOPATH/src/github.com/kuberentes

git clone [email protected]:kubernetes/kubernetes.git
git clone [email protected]:kubernetes/test-infra.git
go install github.com/kubernetes/test-infra/kubetest

# note that pipejakob is a temporary fix for the issue in kubernetes-anywhere
mkdir -p $GOPATH/src/github.com/pipejakob
cd $GOPATH/src/github.com/pipejakob
git clone github.com/pipejakob/kubernetes-anywhere.git

Create bucket

gcloud auth login
Go through the IAM setup described here:

https://github.com/kubernetes/kubernetes-anywhere/tree/master/phase1/gce

cd $GOPATH/src/github.com/pipejakob/kubernetes-anywhere
$ export PROJECT_ID=<my-project>
$ export PROJECT=<my-project>
$ export SERVICE_ACCOUNT="kubernetes-anywhere@${PROJECT_ID}.iam.gserviceaccount.com"
$ gcloud iam service-accounts create kubernetes-anywhere \
    --display-name kubernetes-anywhere
$ gcloud iam service-accounts keys create phase1/gce/account.json \
    --iam-account "${SERVICE_ACCOUNT}"
$ gcloud projects add-iam-policy-binding "${PROJECT_ID}" \
    --member "serviceAccount:${SERVICE_ACCOUNT}" --role roles/editor

Generate SSH Key

ssh-keygen -t rsa -f ~/.ssh/google_compute_engine

Launch e2e

cd $GOPATH/src/github.com/kubernetes/kubernetes
kubetest -v --deployment=kubernetes-anywhere --kubernetes-anywhere-path ${GOPATH}/src/github.com/pipejakob/kubernetes-anywhere --kubernetes-anywhere-phase2-provider kubeadm --kubernetes-anywhere-cluster my-e2e-test --up --test --down

Troubleshooting

Instance account failure

If you are seeing the error:

Error applying plan:

1 error(s) occurred:

* google_compute_instance_group_manager.my-e2e-test-node-group: The service account '<>@cloudservices.gserviceaccount.com' is not associated with the project.


Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.

Re-run gcloud init and set up your default region to us-central1-b

Have no idea why is it the case, but it works :/

Failure to spin up weave

You need to make sure that your host local kubectl is the same version as the cluster you are testing, or you can get errors on the client:

unable to decode ".tmp/weave-net-cluster-role-binding.json": no kind "ClusterRoleBinding" is registered for version "rbac.authorization.k8s.io/v1beta1"
unable to decode ".tmp/weave-net-cluster-role.json": no kind "ClusterRole" is registered for version "rbac.authorization.k8s.io/v1beta1"
Makefile:57: recipe for target 'addons' failed

@klizhentas
Copy link

I'm also getting error running weave tests btw:

unable to decode ".tmp/weave-net-cluster-role-binding.json": no kind "ClusterRoleBinding" is registered for version "rbac.authorization.k8s.io/v1beta1"
unable to decode ".tmp/weave-net-cluster-role.json": no kind "ClusterRole" is registered for version "rbac.authorization.k8s.io/v1beta1"
Makefile:57: recipe for target 'addons' failed

@luxas
Copy link
Member

luxas commented May 12, 2017

@klizhentas You should use git.io/weave-kube for v1.5 and git.io/weave-kube-1.6 for v1.6

@klizhentas
Copy link

@luxas that's actually because my host local kubectl is not the same version as the tested host

@luxas
Copy link
Member

luxas commented May 29, 2017

@klizhentas @pipejakob Do you have an status update of this issue?

@klizhentas
Copy link

klizhentas commented May 29, 2017

I believe

are good to go, the last bit is to add actual Jenkins job

@luxas
Copy link
Member

luxas commented Sep 15, 2017

@pipejakob Most things here are now fixed. Does it still make sense to keep this open or should we open more specialized issues?

@pipejakob
Copy link
Contributor Author

@luxas Agreed, this issue is extremely long and hasn't been kept up to date to track the real state of the world. I'm in favor of closing, and we can open individual issues to track significant remaining work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants