Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm init wait control plane phase dose not check health status of all components #2907

Closed
tommas1988 opened this issue Jul 26, 2023 · 13 comments
Assignees
Labels
area/controlplane kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence.
Milestone

Comments

@tommas1988
Copy link

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version (use kubeadm version): kubeadm version: &version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.2", GitCommit:"7f6f68fdabc4df88cfea2dcf9a19b2b830f1e647", GitTreeState:"clean", BuildDate:"2023-05-17T14:18:49Z", GoVersion:"go1.20.4", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Kubernetes version (use kubectl version):
    # kubectl version
    WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --
    output=yaml|json to get the full version.
    Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.2",
    GitCommit:"7f6f68fdabc4df88cfea2dcf9a19b2b830f1e647", GitTreeState:"clean", BuildDate:"2023-05-17T14:20:07Z",
    GoVersion:"go1.20.4", Compiler:"gc", Platform:"linux/amd64"}
    Kustomize Version: v5.0.1
    Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3",
    GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-14T09:47:40Z",
    GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release): centos8
  • Kernel (e.g. uname -a):
  • Container runtime (CRI) (e.g. containerd, cri-o): docker
  • Container networking plugin (CNI) (e.g. Calico, Cilium):
  • Others:

What happened?

kubeadm init command of phase wait-control-plane dose not check health status of all control plane component but only api server. So when controller manager or scheduler dose not startup properly, init command will still pass this phase.

What you expected to happen?

check health status of all control plane component to make sure all component are startup properly

How to reproduce it (as minimally and precisely as possible)?

provide an undefined option as extra args for controller manager or scheduler

Anything else we need to know?

@neolit123 neolit123 added priority/backlog Higher priority than priority/awaiting-more-evidence. kind/feature Categorizes issue or PR as related to a new feature. labels Jul 26, 2023
@neolit123 neolit123 added this to the v1.29 milestone Jul 26, 2023
@neolit123
Copy link
Member

neolit123 commented Jul 26, 2023

kubeadm init command of phase wait-control-plane dose not check health status of all control plane component but only api server. So when controller manager or scheduler dose not startup properly, init command will still pass this phase.

it has been by design like that since kubeadm early days.
at some point phases were added and "wait-control-plane" just encapsulated waiting for api-server.
(thus marking this as feature / not a bug)

we need to make sure this is not changing user expectations - e.g. in strange cases where there is a custom scheduler with a delayed start (?).

@SataQiu @pacoxu @chendave
are you aware of use cases where kubelet / API server starting is all we care about to complete kubeadm init and the rest of the components may or may not start later?

@tommas1988
Copy link
Author

tommas1988 commented Jul 27, 2023 via email

@chendave
Copy link
Member

kubeadm init command of phase wait-control-plane dose not check health status of all control plane component but only api server.

Kubeadm check the health status of kubelet as well.

are you aware of use cases where kubelet / API server starting is all we care about to complete kubeadm init and the rest of the components may or may not start later?

I haven't noticed that.
@tommas1988 Have you seen this in your env? If we'd gonna support this, we need a real usecase for that.

@pacoxu
Copy link
Member

pacoxu commented Jul 27, 2023

As the component-status API is deprecated, would we just check the pod running status for a short while or print the control-plane status at the end?

@tommas1988
Copy link
Author

Kubeadm check the health status of kubelet as well.

kubeadm did check kubelet and api server, but not include controller manager and scheduler

@tommas1988 Have you seen this in your env? If we'd gonna support this, we need a real usecase for that.

create a config file something like:

kind: ClusterConfiguration
apiVersion: kubeadm.k8s.io/v1beta3
controllerManager:
  extraArgs:
    "not-supported-options": "foo"

will reproduce this bug

@tommas1988
Copy link
Author

As the component-status API is deprecated, would we just check the pod running status for a short while or print the control-plane status at the end?

I have createa a PR for this, but it seems violate import rules

@tommas1988
Copy link
Author

we need to make sure this is not changing user expectations - e.g. in strange cases where there is a custom scheduler with a delayed start (?).

How to make a custom scheduler to delay startup?
For the PR, I just use startup probe from static pod to check health status of all control plane components.

@neolit123
Copy link
Member

thank you for updating the PR to not use the pkg package which is internal to k/k and kubeadm must not use it!
also sorry for the delated review; i've added some comments.

it seems like a nice to have, as it "completes" the component check on startup.
so the questions here remain:

  • do we want this?
  • is this change going to break someone?

@SataQiu @pacoxu @chendave

also @tommas1988 the code you are adding is only for init, but it must also be added for join.

@chendave
Copy link
Member

kind: ClusterConfiguration
apiVersion: kubeadm.k8s.io/v1beta3
controllerManager:
extraArgs:
"not-supported-options": "foo"

will reproduce this bug

since a clusterconfiguration like that can hit the situation where the control manager is not up while kubeadm just say the cluster is ready, I am incline to +1 to fix it as long as it is not breaking anything (I assume any action on a cluster which is not really up is not expected to happen).

@pacoxu
Copy link
Member

pacoxu commented Aug 28, 2023

  • do we want this?
  • is this change going to break someone?

As mentioned earlier, in strange cases where there is a custom scheduler with a delayed start (?)., I think there are some cases that do not want to check the scheduler status at the beginning. Of course, this is not a usual case. https://kubernetes.io/docs/tasks/extend-kubernetes/configure-multiple-schedulers/ The only concern is that we want to make it more extensible.

How about just raising the pod-failing information and not failing the init process?

@neolit123 neolit123 modified the milestones: v1.29, v1.30 Nov 1, 2023
@neolit123
Copy link
Member

this is tracked for 1.30.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 4, 2024
@neolit123 neolit123 removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 4, 2024
@neolit123
Copy link
Member

the feature was added here
kubernetes/kubernetes#123341

we have an e2e test that is green
https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-wait-for-all-control-plane-components-latest
(first failure was because the e2e PR merged before the k/k PR)

docs will be added in
kubernetes/website#45156

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controlplane kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
6 participants