Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate using kind for e2e tests on Prow #103

Closed
adrcunha opened this issue Sep 5, 2018 · 22 comments
Closed

Investigate using kind for e2e tests on Prow #103

adrcunha opened this issue Sep 5, 2018 · 22 comments
Assignees
Milestone

Comments

@adrcunha
Copy link
Contributor

adrcunha commented Sep 5, 2018

kind (kubernetes in docker) might make e2e tests faster since there's no need to create an external cluster. However, it needs to be taken into consideration that we have to start a k8s cluster with a known, public version (e.g. 1.11.1).

@adrcunha
Copy link
Contributor Author

/cc @mattmoor @josephburnett FYI

This should help with the question brought today during Joe's presentation about autoscaling

@mattmoor
Copy link
Member

I'm wondering about the resource requirements we need and it's ability to handle our workloads (e.g. provisioning GCP LoadBalancers).

@chaodaiG
Copy link
Contributor

Investigated using KIND in our e2e test flows, by trying to make all e2e tests in knative/serving passing in KIND. After a couple of PRs(the ones associated with this issue), KIND works for most of tests as shown below:

  • test/e2e-upgrade: 2/2 passed
  • test/e2e: 17/17 passed
  • test/scale: 0/2 passed (They would pass if giving more time, like 5 minutes for scale to 10, and 20 minutes for scaling to 50. This is unlikely due to resource limit, as KIND doesn't have a resource limit, and these tests ran on a linux box with 12 cpus and 64Gb of memory)
  • test/conformance: 56/56 passed (1 test only pass with turning swap off)
  • test/e2e/build: 3/6 passed (Failed build-pipeline tests)

To make KIND work, the following steps are needed:

  • Build node image from k8s.io/kubernetes for specific Kubernetes version, as the default Kubernetes version is much ahead of what we test on GKE
  • Passing cluster name and special KIND kubeconfig file to every e2e tests
  • Figure out ip address from istio and pass it to e2e tests
  • Turn “swap” off

Overall, KIND is not ready yet for running our e2e tests, in considering that the scaling test duration is not comparable to GKE clusters, and the potential benefit it could bring us(shorter clusters creation time, from ~3 minutes to ~10-20 seconds). Also, KIND is not in a stable state yet(master branch is currently broken kubernetes-sigs/kind#509), and there is a breaking change in upcoming 0.3 release branch.

@chaodaiG
Copy link
Contributor

@adrcunha fyi

@BenTheElder
Copy link

BenTheElder commented May 15, 2019

👋 kind dev here, would like to help if there's interest 🙃

test/scale: 0/2 passed (They would pass if giving more time, like 5 minutes for scale to 10, and 20 minutes for scaling to 50. This is unlikely due to resource limit, as KIND doesn't have a resource limit, and these tests ran on a linux box with 12 cpus and 64Gb of memory)

Do you always need to run scale tests? These are necessarily going to be bound by the hardware it runs on and might improve if run on bigger CI nodes, I'd guess your GKE clusters have more horsepower. Are you running scale tests in presubmit?

test/e2e/build: 3/6 passed (Failed build-pipeline tests)

I'd be curious what those failed tests are doing, if you know :-)

Build node image from k8s.io/kubernetes for specific Kubernetes version, as the default Kubernetes version is much ahead of what we test on GKE

What version do you need? kubernetes-sigs/kind#531

Does knative not test with the current Kubernetes release at all? This seems surprising.

Turn “swap” off

Should only be necessary if you need memory limits on your pods, unfortunately there aren't many options there.

Also, KIND is not in a stable state yet(master branch is currently broken kubernetes-sigs/kind#509), and there is a breaking change in upcoming 0.3 release branch.

To clarify, it is not broken. You must build with go modules or use the makefile.
As for stability, sure, but the Kubernetes master branch is not stable either. Using a particular kind release should be stable (E.G. 0.2.1). Please do not install from HEAD in your CI.

https://github.com/kubernetes-sigs/kind/releases has precompiled binaries to make this easier.

0.3 will require new node images (but we're providing them) and changes some internal details of the node that are never supposed to be guaranteed (like which CRI we use). I wouldn't expect knative testing to depend on these details.

@chaodaiG
Copy link
Contributor

Hi @BenTheElder , thanks for prompt response, as well as clarification. Yes we do run scale test in presubmit, and the difference is 240-300 seconds on KIND vs 10-20 seconds on GKE for scaling up to 10, and our e2e tests run in n1-standard-4 machines with 4 nodes, each machine has 4 vcpus and 15 GB mem, this isn't much different from my linux box(12 cpus and 64 GB mem).
In terms of Kubernetes version, we need latest GKE version for consistency reason.

@adrcunha
Copy link
Contributor Author

Do you always need to run scale tests? These are necessarily going to be bound by the hardware it runs on and might improve if run on bigger CI nodes, I'd guess your GKE clusters have more horsepower. Are you running scale tests in presubmit?

At this point we're running them on presubmit, specially because there's active development in the autoscale area.

Does knative not test with the current Kubernetes release at all? This seems surprising.

No, we don't test against the latest k8s version for several reasons, including features and compatibility with other k8s providers (GCP, IBM, Azure, etc). Running against the latest k8s isn't currently a concern.

@BenTheElder
Copy link

BenTheElder commented May 15, 2019

Thanks @chaodaiG @adrcunha that makes sense to me.

Yes we do run scale test in presubmit, and the difference is 240-300 seconds on KIND vs 10-20 seconds on GKE for scaling up to 10, and our e2e tests run in n1-standard-4 machines with 4 nodes, each machine has 4 vcpus and 15 GB mem, this isn't much different from my linux box(12 cpus and 64 GB mem).

So pod start time is the issue? That is good to know and worth looking into on my end...

No, we don't test against the latest k8s version for several reasons, including features and compatibility with other k8s providers (GCP, IBM, Azure, etc).

I see, so if I want to run knative locally I need to match the cloud providers?
Out of curiosity, what version exactly are you testing?

Running against the latest k8s isn't currently a concern.

FWIW It looks like IBM has 1.14.1 with the default at 1.13.6 which is pretty recent (latest from the two most recent Kubernetes branches). AKS has 1.13 GA with 1.14 in preview. GKE supports 1.13.5, and of course customers on all of these clouds use unmanaged clusters to run more recent versions.

kind defaults to 1.14.1 currently but we run CI against all supported Kubernetes versions https://testgrid.k8s.io/conformance-kind

@chaodaiG
Copy link
Contributor

Close this investigation for now, will re-evaluate in the future.

fyi @BenTheElder , based on my observation pod start time is the bottleneck. And thanks a lot for the helps and quick turnaround time along my investigation.

Also fyi @tcnghia , we decided not to integrate with KIND at this point. Thank you for helping me sweeping out most of the failed tests.

/close

@knative-prow-robot
Copy link
Collaborator

@chaodaiG: Closing this issue.

In response to this:

Close this investigation for now, will re-evaluate in the future.

fyi @BenTheElder , based on my observation pod start time is the bottleneck. And thanks a lot for the helps and quick turnaround time along my investigation.

Also fyi @tcnghia , we decided not to integrate with KIND at this point. Thank you for helping me sweeping out most of the failed tests.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@adrcunha
Copy link
Contributor Author

I see, so if I want to run knative locally I need to match the cloud providers?

For stability, yes. knative.dev website has instructions for several cloud providers, and also minikube.

Out of curiosity, what version exactly are you testing?

CI and presubmit run against latest GKE.

kind defaults to 1.14.1 currently but we run CI against all supported Kubernetes versions

Good to know, as it might be necessary in the future, thanks.

@chizhg
Copy link
Member

chizhg commented Apr 2, 2020

/reopen

KIND can help us get a better understanding of the cluster state, to dump the logs to a folder instead (e.g. kind export logs) under $ARTIFACTS in prow, it could be super helpful for developers to fight with the test flakiness. So we should reevaluate using KIND to run integration tests for Knative, at least for some tests.

@knative-prow-robot
Copy link
Collaborator

@chizhg: Reopened this issue.

In response to this:

/reopen

KIND can help us get a better understanding of the cluster state, to dump the logs to a folder instead (e.g. kind export logs) under $ARTIFACTS in prow, it could be super helpful for developers to fight with the test flakiness. So we should reevaluate using KIND to run integration tests for Knative, at least for some tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@chaodaiG
Copy link
Contributor

chaodaiG commented Apr 2, 2020

You'll also need to consider incorporate it with kntest I believe

@mattmoor
Copy link
Member

mattmoor commented Apr 2, 2020

I'd love to see us try this because frankly it's going to be our most reliable way of tracking the set of upstream Kubernetes versions that we want to validate against.

Technically we should be testing against 1.16-1.18 right now, but GKE only has 1.15 in the default channel.

Getting some base set of tests (e.g. conformance) to run against kind across those versions would be pretty valuable IMO.

@knative-housekeeping-robot

Issues go stale after 90 days of inactivity.
Mark the issue as fresh by adding the comment /remove-lifecycle stale.
Stale issues rot after an additional 30 days of inactivity and eventually close.
If this issue is safe to close now please do so by adding the comment /close.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/lifecycle stale

@knative-prow-robot knative-prow-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 2, 2020
@chizhg
Copy link
Member

chizhg commented Jul 2, 2020

/remove-lifecycle stale

@knative-prow-robot knative-prow-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 2, 2020
@coryrc
Copy link
Contributor

coryrc commented Sep 15, 2020

/assign @mattmoor
apparently :-)

@mattmoor
Copy link
Member

I'm not looking at this in Prow
/unassign

@coryrc
Copy link
Contributor

coryrc commented Sep 16, 2020

Ah, glossed over the "on Prow" part, but I figured the "using KIND" was more important :)

@chaodaiG
Copy link
Contributor

Enabled in #2427
/close

@knative-prow-robot
Copy link
Collaborator

@chaodaiG: Closing this issue.

In response to this:

Enabled in #2427
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants