Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing tests for 0.4 and 1.0 #5952

Closed
fabriziopandini opened this issue Jan 19, 2022 · 19 comments · Fixed by #5986
Closed

Failing tests for 0.4 and 1.0 #5952

fabriziopandini opened this issue Jan 19, 2022 · 19 comments · Fixed by #5986
Assignees
Labels
area/testing Issues or PRs related to testing kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/release-blocking Issues or PRs that need to be closed before the next CAPI release
Milestone

Comments

@fabriziopandini
Copy link
Member

The following E2E tests for 0.4 seems to be constantly failing:

https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-0.4#capi-e2e-release-0-4-1-22-1-23
https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-0.4#capi-e2e-release-0-4-1-23-latest

/kind failing-test
/milestone v0.4
/kind release-blocking
/area testing

@k8s-ci-robot k8s-ci-robot added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Jan 19, 2022
@k8s-ci-robot k8s-ci-robot added this to the v0.4 milestone Jan 19, 2022
@k8s-ci-robot k8s-ci-robot added kind/release-blocking Issues or PRs that need to be closed before the next CAPI release area/testing Issues or PRs related to testing labels Jan 19, 2022
@sbueringer
Copy link
Member

Based on KCP controller logs it looks like the CoreDNS migration lib we're using only supports older CoreDNS versions: https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-cluster-api-e2e-workload-upgrade-1-22-1-23-release-0-4/1483442871951429632/artifacts/clusters/bootstrap/controllers/capi-kubeadm-control-plane-controller-manager/capi-kubeadm-control-plane-controller-manager-744575bddc-8jfdj/manager.log

Let's:

  • use older CoreDNS versions compatible with KCP in our e2e tests
  • document which CoreDNS versions we support / we're testing in the book

@killianmuldoon
Copy link
Contributor

/assign

Should the doc change be only for the 0.4 section of the book? I'll create a PR into the test/infra now to change the upgrade target to 1.8.4.

@sbueringer
Copy link
Member

sbueringer commented Jan 19, 2022

@killianmuldoon Do you know which versions are supported by the CoreDNS lib?

Some context about the upgrade test:

  • we're creating the cluster with kubeadm of the source version (i.e. the CoreDNS version of the source version is deployed)
  • during the upgrade we then have the jobs configured to upgrade to the CoreDNS version which the kubeadm of the target version is using

We should now change the jobs that the CoreDNS version is as close as possible to that target kubeadm CoreDNS version (depending on the supported versions by the CoreDNS migration lib). Worst case is that the "target" CoreDNS version (the one configured in the job) is the same as the one the source kubeadm is using.

An example:

Usually we would upgrade to v1.8.6 in the upgrade test. Depending on the supported range of the CoreDNS lib we should upgrade to v1.8.5 or v1.8.4 instead. We should never try to downgrade CoreDNS.

@sbueringer
Copy link
Member

We currently assume that the CoreDNS migration lib doesn't fail when a cluster is created with CoreDNS v1.8.6 (1.23=>latest job) and CoreDNS is "upgraded" to v1.8.6 because then the migration shouldn't be executed.

@killianmuldoon
Copy link
Contributor

killianmuldoon commented Jan 19, 2022

The supported version are here: https://github.com/coredns/corefile-migration/blob/v1.0.13/migration/versions.go

For the given version of the library we're using in CAPI:
0.3: v1.0.12 (max 1.8.4)
0.4: v1.0.12 (max 1.8.4)
1.0: v1.0.13 (max 1.8.5)
1.1: v1.0.14 (max 1.8.6)

@sbueringer
Copy link
Member

sbueringer commented Jan 19, 2022

Some more data

  • kubeadm => CoreDNS versions
    • 1.21 => v1.8.0
    • 1.22 => v1.8.4
    • 1.23 => v1.8.6
    • latest => v1.8.6

Okay I guess now we only have to calculate the maximum supported version for each job and we have it :)

@sbueringer
Copy link
Member

Btw v1.0 jobs are failing too as expected based on the data: https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-1.0#capi-e2e-release-1-0-1-22-1-23

@killianmuldoon
Copy link
Contributor

Yeah - I'm updating that to 1.8.5

@sbueringer
Copy link
Member

sbueringer commented Jan 19, 2022

We currently assume that the CoreDNS migration lib doesn't fail when a cluster is created with CoreDNS v1.8.6 (1.23=>latest job) and CoreDNS is "upgraded" to v1.8.6 because then the migration shouldn't be executed.

That assumption was wrong, even v1.8.6 => v1.8.6 fails: https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-cluster-api-e2e-workload-upgrade-1-23-latest-release-1-0/1483442872060481536/artifacts/clusters/bootstrap/controllers/capi-kubeadm-control-plane-controller-manager/capi-kubeadm-control-plane-controller-manager-744575bddc-qwdqp/manager.log

In my opinion we can and should fix this in KCP (and backport it), so that KCP only runs the migration tool if the CoreDNS version actually changes. Otherwise we have a strict upper limit which CoreDNS version KCP can manage, even if KCP doesn't even have to migrate CoreDNS configuration files.

@killianmuldoon
Copy link
Contributor

Agreed - that should be fixed for sure.

@killianmuldoon
Copy link
Contributor

I'll take a look at the KCP fix too if that's alright with you.

@fabriziopandini
Copy link
Member Author

+1 to the change in KCP to make it possible to upgrade to the same version (I assume this is not only in the webhook, but also in the upgrade logic)

WRT to doc, let's document kubernetes version/default CoreDNS version and CAPI versioni/CoreDNS ranges in https://cluster-api.sigs.k8s.io/reference/versions.html#kubeadm-control-plane-provider-kubeadm-control-plane-controller

@sbueringer
Copy link
Member

/retitle Failing tests for 0.4 and 1.0

@k8s-ci-robot k8s-ci-robot changed the title Failing tests for 0.4 Failing tests for 0.4 and 1.0 Jan 19, 2022
@sbueringer
Copy link
Member

sbueringer commented Jan 20, 2022

Short update. CAPI v0.4 & v1.0 updates from 1.22=>1.23 are green again. 1.23=>latest will be fixed by improving KCP as proposed above.

@killianmuldoon
Copy link
Contributor

/assign

I'm going to tackle the KCP part. Good to see yesterday's change worked!

@sbueringer
Copy link
Member

/reopen
until we merged the cherry-picks for v1.1, v1.0 and v0.4

@k8s-ci-robot
Copy link
Contributor

@sbueringer: Reopened this issue.

In response to this:

/reopen
until we merged the cherry-picks for v1.1, v1.0 and v0.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Jan 27, 2022
@sbueringer
Copy link
Member

/close

PR has been cherry-picked to all releases and testgrid looks good.

@k8s-ci-robot
Copy link
Contributor

@sbueringer: Closing this issue.

In response to this:

/close

PR has been cherry-picked to all releases and testgrid looks good.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/testing Issues or PRs related to testing kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/release-blocking Issues or PRs that need to be closed before the next CAPI release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants