-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubeadm e2e testing #190
Comments
@pipejakob I'll take ownership of kubeadm-specific e2e tests |
@dmmcquay Awesome. I knew you wanted to own that area, I just wasn't sure what the full testplan was (the sublist on the next line). Feel free to update it with the scenarios you hope to exercise. |
@jbeda should also craft some e2e's for the BootstrapSigner and TokenCleaner |
I've been tackling the first one of these (e2e Conformance tests) by adding kubeadm support to kubernetes-anywhere, and support for kubernetes-anywhere as a deployment option to hack/e2e.go (which has now been migrating to test-infra/kubetest). You can bring up, test, and tear down a GCE kubeadm cluster via something like:
You can also specify This is all exercised in the test-infra CI job that gets run by prow here. It uses a custom Docker image and runner. I'll translate this into a more comprehensive document in the repo so it's easier to discover and get started testing. I just wanted to add this comment so anyone else would have the pointers necessary to get started and poke around until a more complete document is written. |
@luxas or anyone else: feel free to add directly to the checklist in the description for scenarios you think we should test for regressions. After a round of brainstorming, I'll drive trying to get SIG consensus around what deliverables belong to what milestones and clean up the list. |
@pipejakob Thanks! I added some bullets to the list. A document describing this flow in this repo will be awesome. |
A few days ago, I sent out a bundle of PRs to get the CI kubeadm e2e test back to green (kubernetes/test-infra#2179 kubernetes/test-infra#2180 kubernetes-retired/kubernetes-anywhere#352 kubernetes-retired/kubernetes-anywhere#353 kubernetes/test-infra#2182 kubernetes/test-infra#2184 kubernetes/test-infra#2183), and most have been merged now. My testing also found a kubeadm bug I was able to fix. Now, the latest issue I'm trying to work through is getting a pod network working on a 1.6 cluster. Weave Net fails because insecure @luxas @dmmcquay If either of you have bandwidth to figure out a reliable pod network to install on 1.6 clusters (or know of one already), that would be very helpful. I'll continue to debug myself in the meantime. |
@pipejakob Weave doesn't work due to the RBAC enablement, not the 8080 thing directly. |
Ah. I had just tried the old instructions of |
@pipejakob It does work with v1.6 |
@luxas So, the weave-net pods seem to come up fine, but kube-dns remains stuck in |
I've gotten the kubeadm e2e CI job back to green: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-gce/480/ The actual Conformance tests are still disabled due to the kube-dns problems I was having, but should be fixed in beta.3. I'll verify a successful e2e run and re-enable them, then create a new prow entry to run the job during pulls as well. |
No luck with beta.3. Let me elaborate on my setup and the problem I'm seeing, because I think the other issue @luxas brought up might have been a red herring. I'm using 5 GCE hosts: 1 master and 4 nodes. On the master:
Then, I have the nodes join, each with:
At this point, everything looks successful, except that
If I describe the pod, I see:
The odd things that stick out to me are that weave-net pods run on every node (but not the master), and have successfully created I'm still new to the CNI architecture, and am not sure what the expectations are between the network provider and kubelet, so I'm not sure if it's wrong that kube-dns is trying to run on the master instead of one of the nodes, or wrong that there is no weave-net pod that runs on the master to create the |
@pipejakob @luxas there is no cni configuration on the node, so kubelet failing to setup/teardown network seems ok to me. It sounds to me that the weve-net pods should also run on the master to populate the files. /cc @kubernetes/sig-network-misc |
@luxas Since your setup gets you further than mine, can you confirm that either you have CNI configured on the master, or kube-dns running on a non-master node? I wanted to confirm in which direction I should try to take the fix. |
@pipejakob Yes, I got past that issue, have a fix for it soon, but seems like we're standing in front of a third issue: weaveworks/weave#2850 (a CNI breaking-change, at least Weave is affected) Will ping you here as soon as I've updated my manifest so you can try again |
@luxas Thanks! |
Wow -- CNI breaking change? That is going to be a huge problem. |
@jbeda you can continue using older CNI code if you want to in the plugin, kubernetes will still work with that version. But by vendoring in the new CNI code, you opt into the new features the 0.5.0 release provides. it's not a "huge problem"; you either opt in or you don't. But if you opt in, obviously some things change. |
Also, even though Kubernetes does not yet understand the new CNI spec return format, the weave plugin should be handling that OK by returning the result in the format that Kubernetes expects (CNI spec version 0.2.0) as long as the CNI network configuration JSON does not set a "cniVersion" greater than 0.2.0. Is that the case? If the config JSON sets "cniVersion" 0.3.0 or higher then that's a misconfiguration and yes, kube will fail to interpret the result. |
A bit out of touch from this discussion, so maybe this has already been discussed / solved - the various discussions I've seen here and in #sig-network have led me to wonder what we can be doing to decouple this e2e testing from third-party CNI manifests which may or may not be stable. I think we should be considering what we can do to use upstream CNI plugins (e.g This certainly becomes easier when kubenet is itself a CNI plugin like any other. |
Both weave and calico don't set the If this is really the problem, a sane thing to do now would be to have "no version" mean 0.2 with a warning. If users want 0.3 they should have to set it explicitly. At some point in the future deprecate 0.2 and make sure everyone is updated. Note that calico isn't complaining about the env variable but the kubelet still has errors like this:
|
Sent out a few more PRs to run kubeadm e2es against PRs and fix a race condition in them that's causing flakes. |
I confirmed that the newest weave-net manifest fixes all of the issues I had been seeing. I'm running the Conformance tests locally now, and will report any findings. If they're green, I'll re-enable them in the CI/pull jobs as well. |
I just had kubernetes-retired/kubernetes-anywhere#363 merged to add support for weave-net to kubernetes-anywhere (which the kubeadm e2e tests use to bring up a cluster), and now have kubernetes/test-infra#2347 in review to take advantage of it. My local testing shows that this allows us to turn Conformance testing back on and have a completely green pass. |
@caseydavenport Sorry I missed your message in this issue; it was right around the CNI firedrill. I like the suggestion around using something like bridge for e2e tests, but I don't think I would completely ditch other scenarios that exercise third-party CNI providers for it. It's probably a mistake for this issue to refer to "the" kubeadm e2e test, or any other implication that it stands alone, since this is just the first of its kind. I will say that the great thing about this first test is that it exercises the actual instructions we give users in our documentation. I haven't heard of any real users using bridge with kubeadm, and we certainly don't have instructions for it, so it's unlikely that it's important that kubeadm+bridge works, it's just trying to remove the uncertainty around an arbitrary third-party provider breaking. However, I don't think this first e2e will or should be considered the gold standard of whether or not kubeadm is broken, but just one signal. I would love to add other e2e scenarios that duplicate the setup but exercise other third-party CNI providers that we advertise in the official documentation. If a kubeadm commit causes them all to fail simultaneously, that's a pretty good signal that we broke kubeadm itself. If a commit causes only some of them to fail, then maybe we've uncovered an underspeced contract or just proactively helped find a bug in those specific providers. This is all predicated on the fact that these jobs aren't blocking PRs, but providing clues to possible regressions over time. When we get closer to having a PR-blocking job (if that even happens), there should be another discussion and consensus to decide the minimal scenario we consider to be indicative of kubeadm's health, since some people consider having a flaky PR-blocking job as worse than having no job at all. In the meantime, I definitely see incremental value in adding another e2e job using bridge (or something like it) as another signal. Would you mind opening a separate issue to track that? I think it's a meaty enough topic to warrant a dedicated issue for discussion / prototyping and I'd like to get your input on the best strategy of configuring something capable of passing Conformance testing. |
@pipejakob thanks for the response :) I think what you've said above makes a lot of sense, and testing a number of providers will likely give us better signal than just one. I've opened #218 to discuss further. |
There was enough fire-fighting at the end of last week around the 1.6.1 fixes that I forgot to mention a few updates on this issue:
|
@pipejakob I'm working on the issue #218 and was digging into current test infra for kubeadm. As a result I wrote this guide, can put it to the right place once I figure out where to put it. Comments are appreciated :) Hacking on Kubeadm e2e testsSet up ToolsGOPATH This guide is using GOPATH, but you are welcome to switch it to any other directory Set up gopath here: https://golang.org/doc/code.html#GOPATH Install jsonnet
Install JQ On Ubuntu/Debian:
Install Terraform Terraform version must be of 0.7.2 otherwise e2e tests wont work.
Set up your GCE andInstall GCE SDK Read the guide here: https://cloud.google.com/sdk/downloads Clone Repos
Create bucket gcloud auth login https://github.com/kubernetes/kubernetes-anywhere/tree/master/phase1/gce cd $GOPATH/src/github.com/pipejakob/kubernetes-anywhere
$ export PROJECT_ID=<my-project>
$ export PROJECT=<my-project>
$ export SERVICE_ACCOUNT="kubernetes-anywhere@${PROJECT_ID}.iam.gserviceaccount.com"
$ gcloud iam service-accounts create kubernetes-anywhere \
--display-name kubernetes-anywhere
$ gcloud iam service-accounts keys create phase1/gce/account.json \
--iam-account "${SERVICE_ACCOUNT}"
$ gcloud projects add-iam-policy-binding "${PROJECT_ID}" \
--member "serviceAccount:${SERVICE_ACCOUNT}" --role roles/editor Generate SSH Key ssh-keygen -t rsa -f ~/.ssh/google_compute_engine Launch e2ecd $GOPATH/src/github.com/kubernetes/kubernetes
kubetest -v --deployment=kubernetes-anywhere --kubernetes-anywhere-path ${GOPATH}/src/github.com/pipejakob/kubernetes-anywhere --kubernetes-anywhere-phase2-provider kubeadm --kubernetes-anywhere-cluster my-e2e-test --up --test --down TroubleshootingInstance account failureIf you are seeing the error:
Re-run Have no idea why is it the case, but it works :/ Failure to spin up weaveYou need to make sure that your host local kubectl is the same version as the cluster you are testing, or you can get errors on the client:
|
I'm also getting error running weave tests btw:
|
@klizhentas You should use |
@luxas that's actually because my host local kubectl is not the same version as the tested host |
@klizhentas @pipejakob Do you have an status update of this issue? |
I believe
are good to go, the last bit is to add actual Jenkins job |
@pipejakob Most things here are now fixed. Does it still make sense to keep this open or should we open more specialized issues? |
@luxas Agreed, this issue is extremely long and hasn't been kept up to date to track the real state of the world. I'm in favor of closing, and we can open individual issues to track significant remaining work. |
This issue is to break out and track work for the opaque "e2e testing" subtask of #63.
This is a work-in-progress, and still needs milestones defined for what we need for kubeadm beta vs. GA.
kubeadm phase
commands do exactly what they should and work well togetherCC @kubernetes/sig-cluster-lifecycle-misc
The text was updated successfully, but these errors were encountered: