Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controller ends up on fargate nodes, fails to start #672

Closed
parviste-fortum opened this issue Dec 23, 2020 · 2 comments · Fixed by #856
Closed

Controller ends up on fargate nodes, fails to start #672

parviste-fortum opened this issue Dec 23, 2020 · 2 comments · Fixed by #856
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@parviste-fortum
Copy link

/kind bug

What happened?

If we first run a pod on a Fargate node and let it finish, it takes a while for the node to go away. If an EBS CSI controller pod is scheduled to start during this time it ends up on the Fargate node, and fails to start with the error: Pod not supported: SchedulerName is not fargate-scheduler.

What you expected to happen?

Fargate nodes set taints to prevent non-fargate pods from being scheduled on them. By default, the conroller has tolerations which means the taint is ignored.

I would expect the controller to respect the taint set on the fargate nodes, so that the controller pods are scheduled on nodes where they can actually run.

How to reproduce it (as minimally and precisely as possible)?

  1. Set up EKS with a Fargate profile matching e.g. the namespace fargate
  2. Run a pod in the fargate namespace
  3. Run kubectl get nodes to observe that a fargate node has been created
  4. Stop the pod
  5. Run kubectl get nodes to confirm that a node is still running
  6. Deploy the EBS CSI controller
  7. Observe that the pod fails to start due to the error mentioned above

Anything else we need to know?:

Seems related to #591, which is about the DaemonSet and tolerations in general. I think this is slightly different, because even if tolerating everything by default might(?) make sense, tolerating fargate compute nodes definitely does not, because it can't possibly work. Thus, at the very least tolerations should be set for everything except Fargate nodes.

#526 allows tolerations to be configured in the Helm chart. While this is one way to fix the issue, it does not really make sense that the default configuration (and only configuration if we deploy using kustomize?) is broken when fargate nodes are running.

Environment

  • Kubernetes version (use kubectl version): v1.17.12-eks-7684af
  • Driver version: 7278cef
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 23, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 23, 2021
@ayberk
Copy link
Contributor

ayberk commented Mar 23, 2021

/remove-lifecycle stale

I know we're not running the daemonset on fargate, I guess we missed the controller.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants