Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics server as addon to k8s CI #1903

Merged

Conversation

jsturtevant
Copy link
Contributor

What type of PR is this?
/kind feature

What this PR does / why we need it:
We would like to run HPA tests in upstream and the metrics server is required for HPA components: #1881

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #1125

Special notes for your reviewer:
I've only added this to the CI Job that is need for now but did tried to do it in a way that would make it pretty easy to add to other jobs and tilt if we wanted.

Also using kustomize to pull in the metric server from the releases page on https://github.com/kubernetes-sigs/metrics-server/releases. This allows for easy updates and overriding settings if needed via kustomize. This could be a pattern we use for the other addons as well.

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

Add metrics server to CI jobs for upstream testing

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 2, 2021
@k8s-ci-robot k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 2, 2021
@jsturtevant jsturtevant force-pushed the add-metrics-server-to-ci branch 4 times, most recently from ab164e6 to 86c615a Compare December 3, 2021 00:25
@jsturtevant
Copy link
Contributor Author

/hold

I've got it deploying but running into errors:

I1203 00:40:24.651567       1 server.go:188] "Failed probe" probe="metric-storage-ready" err="not metrics to serve"
E1203 00:40:29.522716       1 scraper.go:139] "Failed to scrape node" err="Get \"https://10.1.0.5:10250/stats/summary?only_cpu_and_memory=true\": x509: cannot validate certificate for 10.1.0.5 because it doesn't contain any IP SANs" node="capz-conf-bvvs4"
E1203 00:40:29.526596       1 scraper.go:139] "Failed to scrape node" err="Get \"https://10.1.0.4:10250/stats/summary?only_cpu_and_memory=true\": x509: cannot validate certificate for 10.1.0.4 because it doesn't contain any IP SANs" node="capz-conf-gtfr6"
E1203 00:40:29.527545       1 scraper.go:139] "Failed to scrape node" err="Get \"https://10.0.0.4:10250/stats/summary?only_cpu_and_memory=true\": x509: cannot validate certificate for 10.0.0.4 because it doesn't contain any IP SANs" node="capz-conf-5ghwmu-control-plane-9rlx5"

Apparently kubeadm doesn't set this up properly out of the box: https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/troubleshooting-kubeadm/#cannot-use-the-metrics-server-securely-in-a-kubeadm-cluster

It looks like we are missing the ability to properly get signed certs:
https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-certs/#kubelet-serving-certs

apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
serverTLSBootstrap: true

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 3, 2021
@jsturtevant
Copy link
Contributor Author

@jsturtevant
Copy link
Contributor Author

After following the directions in https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-certs/#kubelet-serving-certs (note that ServerTLSBootstrap setting in kubelet config is actually the flag rotate-server-certificates in kubelet command line parameters

setting the field rotate-server-certificates on the kubelet join configurations for all the nodes did work but required manual intervention to approve the CSR before metric server would work. The cluster and the deployment still came up but with the same error as above before the manual approval.

As there is a proposal to address this long term in CAPI I will plan to turn on the metric server flag --kubelet-insecure-tls for the e2e tests with the plan to enable this customers longer term with the CAPI solution. It is possible to write a controller that would auto enable the CSR signing but would be not relevant once the CAPI work is in place.

jsturtevant added a commit to jsturtevant/cluster-api-provider-azure that referenced this pull request Dec 9, 2021
Due to the way that kubeadm configures the kubelet with a self-signed
certificate, the metric server can't verify the kubelet certs.  There is
a proposal in CAPI to manage this long term.  For the e2e tests we
disabling the tls verificiation.  See
kubernetes-sigs#1903 (comment)
for more details.
@jsturtevant jsturtevant force-pushed the add-metrics-server-to-ci branch from 86c615a to e8385ba Compare December 9, 2021 19:18
Due to the way that kubeadm configures the kubelet with a self-signed
certificate, the metric server can't verify the kubelet certs.  There is
a proposal in CAPI to manage this long term.  For the e2e tests we
disabling the tls verificiation.  See
kubernetes-sigs#1903 (comment)
for more details.
@jsturtevant jsturtevant force-pushed the add-metrics-server-to-ci branch from e8385ba to 0093c77 Compare December 9, 2021 19:25
jsturtevant added a commit to jsturtevant/cluster-api-provider-azure that referenced this pull request Dec 9, 2021
Due to the way that kubeadm configures the kubelet with a self-signed
certificate, the metric server can't verify the kubelet certs.  There is
a proposal in CAPI to manage this long term.  For the e2e tests we
disabling the tls verificiation.  See
kubernetes-sigs#1903 (comment)
for more details.
@jsturtevant
Copy link
Contributor Author

/test ls

@k8s-ci-robot
Copy link
Contributor

@jsturtevant: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test pull-cluster-api-provider-azure-build
  • /test pull-cluster-api-provider-azure-e2e
  • /test pull-cluster-api-provider-azure-e2e-windows-dockershim
  • /test pull-cluster-api-provider-azure-test
  • /test pull-cluster-api-provider-azure-verify

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-provider-azure-apidiff
  • /test pull-cluster-api-provider-azure-apiversion-upgrade
  • /test pull-cluster-api-provider-azure-capi-e2e
  • /test pull-cluster-api-provider-azure-ci-entrypoint
  • /test pull-cluster-api-provider-azure-conformance
  • /test pull-cluster-api-provider-azure-conformance-with-ci-artifacts
  • /test pull-cluster-api-provider-azure-coverage
  • /test pull-cluster-api-provider-azure-e2e-exp
  • /test pull-cluster-api-provider-azure-e2e-full
  • /test pull-cluster-api-provider-azure-e2e-workload-upgrade-1-21-1-22
  • /test pull-cluster-api-provider-azure-e2e-workload-upgrade-1-22-latest-main
  • /test pull-cluster-api-provider-azure-upstream-windows-dockershim
  • /test pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts
  • /test pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts-serial-slow

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-provider-azure-apidiff
  • pull-cluster-api-provider-azure-build
  • pull-cluster-api-provider-azure-ci-entrypoint
  • pull-cluster-api-provider-azure-coverage
  • pull-cluster-api-provider-azure-e2e
  • pull-cluster-api-provider-azure-e2e-exp
  • pull-cluster-api-provider-azure-e2e-windows-dockershim
  • pull-cluster-api-provider-azure-test
  • pull-cluster-api-provider-azure-verify

In response to this:

/test ls

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jsturtevant
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-conformance-with-ci-artifacts

@jsturtevant
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts

@jsturtevant
Copy link
Contributor Author

The windows tests ran but have the flakes I am investigating in s separate workstream.

/test pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts

@jsturtevant
Copy link
Contributor Author

/assign @CecileRobertMichon @devigned @mboersma

kind: Kustomization
namespace: kube-system
resources:
- https://github.com/kubernetes-sigs/metrics-server/releases/download/v0.5.2/components.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is awesome, we should use this model for more stuff (eg. cloud-provider) instead of maintaining addon specs in the repo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool! I created a tracking issue with a few possible items: #1917

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 10, 2021
Copy link
Contributor

@devigned devigned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

+1 to how you are pulling the metrics server resources

@CecileRobertMichon
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 10, 2021
@jsturtevant
Copy link
Contributor Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 10, 2021
@k8s-ci-robot k8s-ci-robot merged commit 2b4a27e into kubernetes-sigs:main Dec 11, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.1 milestone Dec 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Metrics server to clusters
5 participants