Controller ends up on fargate nodes, fails to start #672

parviste-fortum · 2020-12-23T07:56:57Z

/kind bug

What happened?

If we first run a pod on a Fargate node and let it finish, it takes a while for the node to go away. If an EBS CSI controller pod is scheduled to start during this time it ends up on the Fargate node, and fails to start with the error: Pod not supported: SchedulerName is not fargate-scheduler.

What you expected to happen?

Fargate nodes set taints to prevent non-fargate pods from being scheduled on them. By default, the conroller has tolerations which means the taint is ignored.

I would expect the controller to respect the taint set on the fargate nodes, so that the controller pods are scheduled on nodes where they can actually run.

How to reproduce it (as minimally and precisely as possible)?

Set up EKS with a Fargate profile matching e.g. the namespace fargate
Run a pod in the fargate namespace
Run kubectl get nodes to observe that a fargate node has been created
Stop the pod
Run kubectl get nodes to confirm that a node is still running
Deploy the EBS CSI controller
Observe that the pod fails to start due to the error mentioned above

Anything else we need to know?:

Seems related to #591, which is about the DaemonSet and tolerations in general. I think this is slightly different, because even if tolerating everything by default might(?) make sense, tolerating fargate compute nodes definitely does not, because it can't possibly work. Thus, at the very least tolerations should be set for everything except Fargate nodes.

#526 allows tolerations to be configured in the Helm chart. While this is one way to fix the issue, it does not really make sense that the default configuration (and only configuration if we deploy using kustomize?) is broken when fargate nodes are running.

Environment

Kubernetes version (use kubectl version): v1.17.12-eks-7684af
Driver version: 7278cef

The text was updated successfully, but these errors were encountered:

fejta-bot · 2021-03-23T08:55:43Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

ayberk · 2021-03-23T15:30:55Z

/remove-lifecycle stale

I know we're not running the daemonset on fargate, I guess we missed the controller.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 23, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 23, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 23, 2021

krmichel mentioned this issue Mar 25, 2021

Clean up helm chart + kustomize overlays #797

Closed

krmichel mentioned this issue Apr 28, 2021

Cleanup helm chart #856

Merged

k8s-ci-robot closed this as completed in #856 May 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Controller ends up on fargate nodes, fails to start #672

Controller ends up on fargate nodes, fails to start #672

parviste-fortum commented Dec 23, 2020

fejta-bot commented Mar 23, 2021

ayberk commented Mar 23, 2021

Controller ends up on fargate nodes, fails to start #672

Controller ends up on fargate nodes, fails to start #672

Comments

parviste-fortum commented Dec 23, 2020

fejta-bot commented Mar 23, 2021

ayberk commented Mar 23, 2021