Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unmanaged nodes go into unready status when fargate profile is added #2290

Closed
datGnomeLife opened this issue Jun 4, 2020 · 4 comments
Closed
Assignees
Labels
kind/feature New feature or request priority/important-longterm Important over the long term, but may not be currently staffed and/or may require multiple releases

Comments

@datGnomeLife
Copy link

What happened?
When trying to add fargate to an existing cluster, 1.14, with un-managed nodes. If I create a cluster with unmanaged nodes, everything ok! As soon as I add a fargate profile, the un-managed node goes into unready status. The fargate node is healthy and ready and the pod running on fargate is healthy and ready

What you expected to happen?
That the un-managed node remains in ready status

How to reproduce it?

  1. Create a cluster with a single unmanaged node group using a config file
    eksctl create cluster -f cluster-config.yaml

  2. Add fargate profile targeting the default namespace to config file, then run create node group to add the fargatePodExecutionRoleARN to the cluster
    eksctl create nodegroup --config-file=cluster-config.yaml

  3. Add the Fargate profile
    eksctl create fargateprofile -f cluster-config.yaml

Anything else we need to know?
cluster-config.yaml with redacted information

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: vstae1npeaplk03
  region: us-east-1
  version: "1.14"

# # this example specify a given role ARN for service
iam:
  fargatePodExecutionRoleARN: "arn:aws:iam::111111111111:role/datalake.NonProd.dleksworker.svcrole" 
  serviceRoleARN: "arn:aws:iam::111111111111:role/datalake.NonProd.dleks.svcrole"

vpc:
  id: "vpc-abcdefg1234567890"
  subnets:
    private:
      us-east-1a:
          id: "subnet-abcdefg1234567890"
      us-east-1b:
          id: "subnet-abcdefg1234567891"
      us-east-1c:
          id: "subnet-abcdefg1234567892"
  clusterEndpoints:
      privateAccess: true
      publicAccess: true

fargateProfiles:
  - name: fp-dev
    selectors:
      - namespace: default

nodeGroups:
  - name: test-unmanaged-ng-01-v1
    instanceType: t3.small
    desiredCapacity: 1
    minSize: 1
    maxSize: 2
    volumeSize: 80
    volumeType: gp2
    privateNetworking: true
    ssh:
      allow: true # will use ~/.ssh/id_rsa.pub as the default ssh key
      publicKeyName: 'my_key_pair'
    labels:
      nodegroup-label: ng-01 
    iam:
        instanceProfileARN: "arn:aws:iam::111111111111:instance-profile/datalake.NonProd.dleksworker.svcrole"
        instanceRoleARN: "arn:aws:iam::111111111111:role/datalake.NonProd.dleksworker.svcrole"
    tags:
      k8s.io/cluster-autoscaler/enabled: "true"
      k8s.io/cluster-autoscaler/vstae1npeaplk03: "owned" 

Versions
Please paste in the output of these commands:

$ eksctl version
0.20.0

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.6-beta.0", GitCommit:"e7f962ba86f4ce7033828210ca3556393c377bcc", GitTreeState:"clean", BuildDate:"2020-01-15T08:26:26Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"darwin/amd64"}

Logs

[ℹ]  eksctl version 0.20.0
[ℹ]  using region us-east-1
[✔]  using existing VPC (vpc-****************) and subnets (private:[subnet-**************** subnet-**************** subnet-****************] public:[])
[!]  custom VPC/subnets will be used; if resulting cluster doesn't function as expected, make sure to review the configuration of VPC/subnets
[ℹ]  nodegroup "eap-nonprod-test-unmanaged-ng-01-v3" will use "ami-07e0ca5eb121d3ed8" [AmazonLinux2/1.14]
[ℹ]  using EC2 key pair "****************"
[ℹ]  using Kubernetes version 1.14
[ℹ]  creating EKS cluster "vstae1npeaplk03" in "us-east-1" region with un-managed nodes
[ℹ]  1 nodegroup (eap-nonprod-test-unmanaged-ng-01-v3) was included (based on the include/exclude rules)
[ℹ]  will create a CloudFormation stack for cluster itself and 1 nodegroup stack(s)
[ℹ]  will create a CloudFormation stack for cluster itself and 0 managed nodegroup stack(s)
[ℹ]  if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=us-east-1 --cluster=vstae1npeaplk03'
[ℹ]  CloudWatch logging will not be enabled for cluster "vstae1npeaplk03" in "us-east-1"
[ℹ]  you can enable it with 'eksctl utils update-cluster-logging --region=us-east-1 --cluster=vstae1npeaplk03'
[ℹ]  Kubernetes API endpoint access will use provided values {publicAccess=true, privateAccess=true} for cluster "vstae1npeaplk03" in "us-east-1"
[ℹ]  2 sequential tasks: { create cluster control plane "vstae1npeaplk03", 2 parallel sub-tasks: { 2 sequential sub-tasks: { tag cluster, update cluster VPC endpoint access configuration }, create nodegroup "eap-nonprod-test-unmanaged-ng-01-v3" } }
[ℹ]  building cluster stack "eksctl-vstae1npeaplk03-cluster"
[ℹ]  deploying stack "eksctl-vstae1npeaplk03-cluster"
[ℹ]  building nodegroup stack "eksctl-vstae1npeaplk03-nodegroup-eap-nonprod-test-unmanaged-ng-01-v3"
[ℹ]  deploying stack "eksctl-vstae1npeaplk03-nodegroup-eap-nonprod-test-unmanaged-ng-01-v3"
[✔]  tagged EKS cluster (****************)
[!]  retryable error (Throttling: Rate exceeded
	status code: 400, request id: efe3d23c-f147-479b-91ea-1dc4c325d26f) from cloudformation/DescribeStacks - will retry after delay of 954.939668ms
[ℹ]  waiting for the control plane availability...
[✔]  saved kubeconfig as "/Users/ds06/.kube/config"
[ℹ]  no tasks
[✔]  all EKS cluster resources for "vstae1npeaplk03" have been created
[ℹ]  adding identity "arn:aws:iam::****************:role/datalake.NonProd.dleksworker.svcrole" to auth ConfigMap
[ℹ]  nodegroup "eap-nonprod-test-unmanaged-ng-01-v3" has 0 node(s)
[ℹ]  waiting for at least 1 node(s) to become ready in "eap-nonprod-test-unmanaged-ng-01-v3"
[ℹ]  nodegroup "eap-nonprod-test-unmanaged-ng-01-v3" has 1 node(s)
[ℹ]  node "ip-10-170-2-164.ec2.internal" is ready
[ℹ]  kubectl command should work with "/Users/ds06/.kube/config", try 'kubectl get nodes'
[✔]  EKS cluster "vstae1npeaplk03" in "us-east-1" region is ready

Updating nodegroup to add fargate execution role to cluster:

[ℹ]  eksctl version 0.20.0
[ℹ]  using region us-east-1
[ℹ]  1 existing nodegroup(s) (eap-nonprod-test-unmanaged-ng-01-v3) will be excluded
[ℹ]  combined exclude rules: eap-nonprod-test-unmanaged-ng-01-v3
[ℹ]  1 nodegroup (eap-nonprod-test-unmanaged-ng-01-v3) was excluded (based on the include/exclude rules)
[ℹ]  2 sequential tasks: { fix cluster compatibility, no tasks }
[ℹ]  checking cluster stack for missing resources
[ℹ]  cluster stack is missing resources for Fargate
[ℹ]  adding missing resources to cluster stack
[ℹ]  re-building cluster stack "eksctl-vstae1npeaplk03-cluster"
[ℹ]  updating stack to add new resources [] and outputs [FargatePodExecutionRoleARN]
[ℹ]  no tasks
[✔]  created 0 nodegroup(s) in cluster "vstae1npeaplk03"
[✔]  created 0 managed nodegroup(s) in cluster "vstae1npeaplk03"
[ℹ]  checking security group configuration for all nodegroups
[ℹ]  all nodegroups have up-to-date configuration

Creating fargate profile:

[ℹ]  creating Fargate profile "fp-dev" on EKS cluster "vstae1npeaplk03"
[ℹ]  created Fargate profile "fp-dev" on EKS cluster "vstae1npeaplk03"

kubeconfig on wokernode that then goes into unready status:

apiVersion: v1
kind: Config
clusters:
- cluster:
    certificate-authority: /etc/kubernetes/pki/ca.crt
    server: MASTER_ENDPOINT
  name: kubernetes
contexts:
- context:
    cluster: kubernetes
    user: kubelet
  name: kubelet
current-context: kubelet
users:
- name: kubelet
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1alpha1
      command: /usr/bin/aws-iam-authenticator
      args:
        - "token"
        - "-i"
        - "CLUSTER_NAME"
        - --region
        - "AWS_REGION"

aws-auth configmap

---
apiVersion: v1
data:
  mapRoles: |
    - groups:
      - system:bootstrappers
      - system:nodes
      rolearn: arn:aws:iam::***************:role/datalake.NonProd.dleksworker.svcrole
      username: system:node:{{EC2PrivateDNSName}}
    - groups:
      - system:bootstrappers
      - system:nodes
      - system:node-proxier
      rolearn: arn:aws:iam::**************:role/datalake.NonProd.dleksworker.svcrole
      username: system:node:{{SessionName}}
  mapUsers: |
    []

Kubelet logs after the node goes into unready status

$ journalctl -u kubelet -n 100

Jun 03 18:41:31 ip-10-170-6-20.ec2.internal kubelet[3763]: E0603 18:41:31.003664    3763 reflector.go:126] object-"amazon-cloudwatch"/"cloudwatch-agent-token-z7sxt": Failed to list *v1.Secret: secrets "cloudwatch-agent-token-z7sxt" is forbidden: User "system:node:i-01e7f6b3a2b46b5b3" cannot list resource "secrets" in API group "" in the namespace "amazon-cloudwatch": no relationship found between node "i-01e7f6b3a2b46b5b3" and this object
Jun 03 18:41:31 ip-10-170-6-20.ec2.internal kubelet[3763]: E0603 18:41:31.199923    3763 reflector.go:126] object-"kube-system"/"aws-node-token-msh5s": Failed to list *v1.Secret: secrets "aws-node-token-msh5s" is forbidden: User "system:node:i-01e7f6b3a2b46b5b3" cannot list resource "secrets" in API group "" in the namespace "kube-system": no relationship found between node "i-01e7f6b3a2b46b5b3" and this object
Jun 03 18:41:31 ip-10-170-6-20.ec2.internal kubelet[3763]: E0603 18:41:31.600753    3763 reflector.go:126] object-"kube-system"/"kube-proxy": Failed to list *v1.ConfigMap: configmaps "kube-proxy" is forbidden: User "system:node:i-01e7f6b3a2b46b5b3" cannot list resource "configmaps" in API group "" in the namespace "kube-system": no relationship found between node "i-01e7f6b3a2b46b5b3" and this object
Jun 03 18:41:31 ip-10-170-6-20.ec2.internal kubelet[3763]: E0603 18:41:31.799801    3763 reflector.go:126] object-"kube-system"/"coredns": Failed to list *v1.ConfigMap: configmaps "coredns" is forbidden: User "system:node:i-01e7f6b3a2b46b5b3" cannot list resource "configmaps" in API group "" in the namespace "kube-system": no relationship found between node "i-01e7f6b3a2b46b5b3" and this object
Jun 03 18:41:32 ip-10-170-6-20.ec2.internal kubelet[3763]: E0603 18:41:31.999994    3763 reflector.go:126] object-"amazon-cloudwatch"/"fluentd-config": Failed to list *v1.ConfigMap: configmaps "fluentd-config" is forbidden: User "system:node:i-01e7f6b3a2b46b5b3" cannot list resource "configmaps" in API group "" in the namespace "amazon-cloudwatch": no relationship found between node "i-01e7f6b3a2b46b5b3" and this object
Jun 03 18:41:32 ip-10-170-6-20.ec2.internal kubelet[3763]: E0603 18:41:32.199993    3763 reflector.go:126] object-"kube-system"/"coredns-token-mzr2x": Failed to list *v1.Secret: secrets "coredns-token-mzr2x" is forbidden: User "system:node:i-01e7f6b3a2b46b5b3" cannot list resource "secrets" in API group "" in the namespace "kube-system": no relationship found between node "i-01e7f6b3a2b46b5b3" and this object
Jun 03 18:41:32 ip-10-170-6-20.ec2.internal kubelet[3763]: E0603 18:41:32.400048    3763 reflector.go:126] object-"amazon-cloudwatch"/"cwagentconfig": Failed to list *v1.ConfigMap: configmaps "cwagentconfig" is forbidden: User "system:node:i-01e7f6b3a2b46b5b3" cannot list resource "configmaps" in API group "" in the namespace "amazon-cloudwatch": no relationship found between node "i-01e7f6b3a2b46b5b3" and this object
Jun 03 18:41:32 ip-10-170-6-20.ec2.internal kubelet[3763]: E0603 18:41:32.600073    3763 reflector.go:126] object-"amazon-cloudwatch"/"fluentd-token-6r75w": Failed to list *v1.Secret: secrets "fluentd-token-6r75w" is forbidden: User "system:node:i-01e7f6b3a2b46b5b3" cannot list resource "secrets" in API group "" in the namespace "amazon-cloudwatch": no relationship found between node "i-01e7f6b3a2b46b5b3" and this object
Jun 03 18:41:32 ip-10-170-6-20.ec2.internal kubelet[3763]: E0603 18:41:32.801233    3763 reflector.go:126] object-"kube-system"/"kube-proxy-config": Failed to list *v1.ConfigMap: configmaps "kube-proxy-config" is forbidden: User "system:node:i-01e7f6b3a2b46b5b3" cannot list resource "configmaps" in API group "" in the namespace "kube-system": no relationship found between node "i-01e7f6b3a2b46b5b3" and this object
Jun 03 18:41:33 ip-10-170-6-20.ec2.internal kubelet[3763]: W0603 18:41:33.004111    3763 status_manager.go:501] Failed to update status for pod "fluentd-cloudwatch-7ppvz_amazon-cloudwatch(42c7307e-a5c2-11ea-829d-025f0412bf9f)": failed to patch status "{\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"Initialized\"},{\"type\":\"Ready\"},{\"type\":\"ContainersReady\"},{\"type\":\"PodScheduled\"}],\"conditions\":[{\"status\":\"True\",\"type\":\"Ready\"}]}}" for pod "amazon-cloudwatch"/"fluentd-cloudwatch-7ppvz": pods "fluentd-cloudwatch-7ppvz" is forbidden: node "i-01e7f6b3a2b46b5b3" can only update pod status for pods with spec.nodeName set to itself
Jun 03 18:41:33 ip-10-170-6-20.ec2.internal kubelet[3763]: E0603 18:41:33.200196    3763 reflector.go:126] object-"kube-system"/"kube-proxy-token-wl2v8": Failed to list *v1.Secret: secrets "kube-proxy-token-wl2v8" is forbidden: User "system:node:i-01e7f6b3a2b46b5b3" cannot list resource "secrets" in API group "" in the namespace "kube-system": no relationship found between node "i-01e7f6b3a2b46b5b3" and this object
@datGnomeLife datGnomeLife changed the title Unmanaged Nodes Go into unready status when fargate profile is added Unmanaged nodes go into unready status when fargate profile is added Jun 4, 2020
@martina-if martina-if added the priority/important-soon Ideally to be resolved in time for the next release label Jun 10, 2020
@datGnomeLife
Copy link
Author

datGnomeLife commented Jun 22, 2020

I have identified the issue was due to the fargatePodExecutionRoleARN was the same as the instanceRoleARN arn. I have since create a seperate role just for fargate and can create a new cluster and everything works! The issue now is when I try to update the fargatePodExecutionRoleARN on the existing cluster it doesn't create the fargate profile with the new role. Is there a way to force an update of the fargatePodExecutionRoleARN?

@datGnomeLife
Copy link
Author

I was able to resolve this issue by manually updating the cluster cf stack. If I understand correctly it uses the output of the stack for the fargatePodExecutionRoleARN when creating a fargate profile. It might be worth considering a feature on eksctl update cluster to include a flag like --skip-control-plane when you just want to update the stack and not the control plane

@prithviramesh
Copy link

Seems related to: kubernetes-sigs/aws-iam-authenticator#271

@michaelbeaumont michaelbeaumont added kind/feature New feature or request priority/important-longterm Important over the long term, but may not be currently staffed and/or may require multiple releases and removed kind/bug priority/important-soon Ideally to be resolved in time for the next release labels Jul 1, 2020
@martina-if
Copy link
Contributor

martina-if commented Jul 7, 2020

It might be worth considering a feature on eksctl update cluster to include a flag like --skip-control-plane when you just want to update the stack and not the control plane

Hi @datGnomeLife Indeed, that is a good use case and it's tracked here so I will close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request priority/important-longterm Important over the long term, but may not be currently staffed and/or may require multiple releases
Projects
None yet
Development

No branches or pull requests

4 participants