Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KCP resilience to machine disk space issues #3289

Closed
benmoss opened this issue Jul 6, 2020 · 4 comments
Closed

KCP resilience to machine disk space issues #3289

benmoss opened this issue Jul 6, 2020 · 4 comments
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management kind/feature Categorizes issue or PR as related to a new feature.
Milestone

Comments

@benmoss
Copy link

benmoss commented Jul 6, 2020

What steps did you take and what happened:
Adapted this space quota tutorial to a KCP cluster a slightly modified version of this gist to fill up etcd. After a while I saw kube-apiserver crash on one of the machines, I tried to delete the bad machine but then ran into #2331 since the pods could not be drained.

I'm not really sure what went wrong, I think we could probably develop a faster method of filling up the etcd and maybe have some kind of simulation test of this that we can run periodically or something.

What did you expect to happen:
etcd should stop accepting writes, not sure how k8s is intended to behave when etcd can't accept new writes.

Anything else you would like to add:
This came up as part of #3185, we were looking at space quotas and the alarms. I didn't see any alarms getting raised before the apiserver crashed.

Environment:

  • Cluster-api version: e1ed12c
  • Minikube/KIND version: kind 0.8.1
  • Kubernetes version: (use kubectl version): 1.18.2

/kind feature
/area control-plane

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. area/control-plane Issues or PRs related to control-plane lifecycle management kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jul 6, 2020
@vincepri
Copy link
Member

vincepri commented Jul 8, 2020

/milestone v0.3.x

@k8s-ci-robot k8s-ci-robot added this to the v0.3.x milestone Jul 8, 2020
@benmoss
Copy link
Author

benmoss commented Jul 8, 2020

Related issue: kubernetes/kubeadm#2195

@vincepri
Copy link
Member

vincepri commented Aug 3, 2020

New conditions introduced in #3138 might solve part of this issue, in terms of providing more information on the underlying node. Fixing #2331 might let the control plane come up correctly again, other than that there is no action item here for Cluster API, @ncdc pointed out most of the checks might happen in a separate monitoring environment, operators will have to either integrate with CAPI (spec.upgradeAfter) or free up some space.

/close

@k8s-ci-robot
Copy link
Contributor

@vincepri: Closing this issue.

In response to this:

New conditions introduced in #3138 might solve part of this issue, in terms of providing more information on the underlying node. Fixing #2331 might let the control plane come up correctly again, other than that there is no action item here for Cluster API, @ncdc pointed out most of the checks might happen in a separate monitoring environment, operators will have to either integrate with CAPI (spec.upgradeAfter) or free up some space.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

3 participants