KCP resilience to machine disk space issues #3289

benmoss · 2020-07-06T15:27:03Z

What steps did you take and what happened:
Adapted this space quota tutorial to a KCP cluster a slightly modified version of this gist to fill up etcd. After a while I saw kube-apiserver crash on one of the machines, I tried to delete the bad machine but then ran into #2331 since the pods could not be drained.

I'm not really sure what went wrong, I think we could probably develop a faster method of filling up the etcd and maybe have some kind of simulation test of this that we can run periodically or something.

What did you expect to happen:
etcd should stop accepting writes, not sure how k8s is intended to behave when etcd can't accept new writes.

Anything else you would like to add:
This came up as part of #3185, we were looking at space quotas and the alarms. I didn't see any alarms getting raised before the apiserver crashed.

Environment:

Cluster-api version: e1ed12c
Minikube/KIND version: kind 0.8.1
Kubernetes version: (use kubectl version): 1.18.2

/kind feature
/area control-plane

The text was updated successfully, but these errors were encountered:

vincepri · 2020-07-08T17:36:20Z

/milestone v0.3.x

benmoss · 2020-07-08T17:36:30Z

Related issue: kubernetes/kubeadm#2195

vincepri · 2020-08-03T18:28:35Z

New conditions introduced in #3138 might solve part of this issue, in terms of providing more information on the underlying node. Fixing #2331 might let the control plane come up correctly again, other than that there is no action item here for Cluster API, @ncdc pointed out most of the checks might happen in a separate monitoring environment, operators will have to either integrate with CAPI (spec.upgradeAfter) or free up some space.

/close

k8s-ci-robot · 2020-08-03T18:28:42Z

@vincepri: Closing this issue.

In response to this:

New conditions introduced in #3138 might solve part of this issue, in terms of providing more information on the underlying node. Fixing #2331 might let the control plane come up correctly again, other than that there is no action item here for Cluster API, @ncdc pointed out most of the checks might happen in a separate monitoring environment, operators will have to either integrate with CAPI (spec.upgradeAfter) or free up some space.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added this to the v0.3.x milestone Jul 8, 2020

k8s-ci-robot closed this as completed Aug 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KCP resilience to machine disk space issues #3289

KCP resilience to machine disk space issues #3289

benmoss commented Jul 6, 2020 •

edited

Loading

vincepri commented Jul 8, 2020

benmoss commented Jul 8, 2020

vincepri commented Aug 3, 2020

k8s-ci-robot commented Aug 3, 2020

KCP resilience to machine disk space issues #3289

KCP resilience to machine disk space issues #3289

Comments

benmoss commented Jul 6, 2020 • edited Loading

vincepri commented Jul 8, 2020

benmoss commented Jul 8, 2020

vincepri commented Aug 3, 2020

k8s-ci-robot commented Aug 3, 2020

benmoss commented Jul 6, 2020 •

edited

Loading