Unmanaged nodes go into unready status when fargate profile is added #2290

datGnomeLife · 2020-06-04T16:13:41Z

What happened?
When trying to add fargate to an existing cluster, 1.14, with un-managed nodes. If I create a cluster with unmanaged nodes, everything ok! As soon as I add a fargate profile, the un-managed node goes into unready status. The fargate node is healthy and ready and the pod running on fargate is healthy and ready

What you expected to happen?
That the un-managed node remains in ready status

How to reproduce it?

Create a cluster with a single unmanaged node group using a config file
eksctl create cluster -f cluster-config.yaml
Add fargate profile targeting the default namespace to config file, then run create node group to add the fargatePodExecutionRoleARN to the cluster
eksctl create nodegroup --config-file=cluster-config.yaml
Add the Fargate profile
eksctl create fargateprofile -f cluster-config.yaml

Anything else we need to know?
cluster-config.yaml with redacted information

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: vstae1npeaplk03
  region: us-east-1
  version: "1.14"

# # this example specify a given role ARN for service
iam:
  fargatePodExecutionRoleARN: "arn:aws:iam::111111111111:role/datalake.NonProd.dleksworker.svcrole" 
  serviceRoleARN: "arn:aws:iam::111111111111:role/datalake.NonProd.dleks.svcrole"

vpc:
  id: "vpc-abcdefg1234567890"
  subnets:
    private:
      us-east-1a:
          id: "subnet-abcdefg1234567890"
      us-east-1b:
          id: "subnet-abcdefg1234567891"
      us-east-1c:
          id: "subnet-abcdefg1234567892"
  clusterEndpoints:
      privateAccess: true
      publicAccess: true

fargateProfiles:
  - name: fp-dev
    selectors:
      - namespace: default

nodeGroups:
  - name: test-unmanaged-ng-01-v1
    instanceType: t3.small
    desiredCapacity: 1
    minSize: 1
    maxSize: 2
    volumeSize: 80
    volumeType: gp2
    privateNetworking: true
    ssh:
      allow: true # will use ~/.ssh/id_rsa.pub as the default ssh key
      publicKeyName: 'my_key_pair'
    labels:
      nodegroup-label: ng-01 
    iam:
        instanceProfileARN: "arn:aws:iam::111111111111:instance-profile/datalake.NonProd.dleksworker.svcrole"
        instanceRoleARN: "arn:aws:iam::111111111111:role/datalake.NonProd.dleksworker.svcrole"
    tags:
      k8s.io/cluster-autoscaler/enabled: "true"
      k8s.io/cluster-autoscaler/vstae1npeaplk03: "owned"

Versions
Please paste in the output of these commands:

$ eksctl version
0.20.0

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.6-beta.0", GitCommit:"e7f962ba86f4ce7033828210ca3556393c377bcc", GitTreeState:"clean", BuildDate:"2020-01-15T08:26:26Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"darwin/amd64"}

Logs

[ℹ]  eksctl version 0.20.0
[ℹ]  using region us-east-1
[✔]  using existing VPC (vpc-****************) and subnets (private:[subnet-**************** subnet-**************** subnet-****************] public:[])
[!]  custom VPC/subnets will be used; if resulting cluster doesn't function as expected, make sure to review the configuration of VPC/subnets
[ℹ]  nodegroup "eap-nonprod-test-unmanaged-ng-01-v3" will use "ami-07e0ca5eb121d3ed8" [AmazonLinux2/1.14]
[ℹ]  using EC2 key pair "****************"
[ℹ]  using Kubernetes version 1.14
[ℹ]  creating EKS cluster "vstae1npeaplk03" in "us-east-1" region with un-managed nodes
[ℹ]  1 nodegroup (eap-nonprod-test-unmanaged-ng-01-v3) was included (based on the include/exclude rules)
[ℹ]  will create a CloudFormation stack for cluster itself and 1 nodegroup stack(s)
[ℹ]  will create a CloudFormation stack for cluster itself and 0 managed nodegroup stack(s)
[ℹ]  if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=us-east-1 --cluster=vstae1npeaplk03'
[ℹ]  CloudWatch logging will not be enabled for cluster "vstae1npeaplk03" in "us-east-1"
[ℹ]  you can enable it with 'eksctl utils update-cluster-logging --region=us-east-1 --cluster=vstae1npeaplk03'
[ℹ]  Kubernetes API endpoint access will use provided values {publicAccess=true, privateAccess=true} for cluster "vstae1npeaplk03" in "us-east-1"
[ℹ]  2 sequential tasks: { create cluster control plane "vstae1npeaplk03", 2 parallel sub-tasks: { 2 sequential sub-tasks: { tag cluster, update cluster VPC endpoint access configuration }, create nodegroup "eap-nonprod-test-unmanaged-ng-01-v3" } }
[ℹ]  building cluster stack "eksctl-vstae1npeaplk03-cluster"
[ℹ]  deploying stack "eksctl-vstae1npeaplk03-cluster"
[ℹ]  building nodegroup stack "eksctl-vstae1npeaplk03-nodegroup-eap-nonprod-test-unmanaged-ng-01-v3"
[ℹ]  deploying stack "eksctl-vstae1npeaplk03-nodegroup-eap-nonprod-test-unmanaged-ng-01-v3"
[✔]  tagged EKS cluster (****************)
[!]  retryable error (Throttling: Rate exceeded
	status code: 400, request id: efe3d23c-f147-479b-91ea-1dc4c325d26f) from cloudformation/DescribeStacks - will retry after delay of 954.939668ms
[ℹ]  waiting for the control plane availability...
[✔]  saved kubeconfig as "/Users/ds06/.kube/config"
[ℹ]  no tasks
[✔]  all EKS cluster resources for "vstae1npeaplk03" have been created
[ℹ]  adding identity "arn:aws:iam::****************:role/datalake.NonProd.dleksworker.svcrole" to auth ConfigMap
[ℹ]  nodegroup "eap-nonprod-test-unmanaged-ng-01-v3" has 0 node(s)
[ℹ]  waiting for at least 1 node(s) to become ready in "eap-nonprod-test-unmanaged-ng-01-v3"
[ℹ]  nodegroup "eap-nonprod-test-unmanaged-ng-01-v3" has 1 node(s)
[ℹ]  node "ip-10-170-2-164.ec2.internal" is ready
[ℹ]  kubectl command should work with "/Users/ds06/.kube/config", try 'kubectl get nodes'
[✔]  EKS cluster "vstae1npeaplk03" in "us-east-1" region is ready

Updating nodegroup to add fargate execution role to cluster:

[ℹ]  eksctl version 0.20.0
[ℹ]  using region us-east-1
[ℹ]  1 existing nodegroup(s) (eap-nonprod-test-unmanaged-ng-01-v3) will be excluded
[ℹ]  combined exclude rules: eap-nonprod-test-unmanaged-ng-01-v3
[ℹ]  1 nodegroup (eap-nonprod-test-unmanaged-ng-01-v3) was excluded (based on the include/exclude rules)
[ℹ]  2 sequential tasks: { fix cluster compatibility, no tasks }
[ℹ]  checking cluster stack for missing resources
[ℹ]  cluster stack is missing resources for Fargate
[ℹ]  adding missing resources to cluster stack
[ℹ]  re-building cluster stack "eksctl-vstae1npeaplk03-cluster"
[ℹ]  updating stack to add new resources [] and outputs [FargatePodExecutionRoleARN]
[ℹ]  no tasks
[✔]  created 0 nodegroup(s) in cluster "vstae1npeaplk03"
[✔]  created 0 managed nodegroup(s) in cluster "vstae1npeaplk03"
[ℹ]  checking security group configuration for all nodegroups
[ℹ]  all nodegroups have up-to-date configuration

Creating fargate profile:

[ℹ]  creating Fargate profile "fp-dev" on EKS cluster "vstae1npeaplk03"
[ℹ]  created Fargate profile "fp-dev" on EKS cluster "vstae1npeaplk03"

kubeconfig on wokernode that then goes into unready status:

apiVersion: v1
kind: Config
clusters:
- cluster:
    certificate-authority: /etc/kubernetes/pki/ca.crt
    server: MASTER_ENDPOINT
  name: kubernetes
contexts:
- context:
    cluster: kubernetes
    user: kubelet
  name: kubelet
current-context: kubelet
users:
- name: kubelet
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1alpha1
      command: /usr/bin/aws-iam-authenticator
      args:
        - "token"
        - "-i"
        - "CLUSTER_NAME"
        - --region
        - "AWS_REGION"

aws-auth configmap

---
apiVersion: v1
data:
  mapRoles: |
    - groups:
      - system:bootstrappers
      - system:nodes
      rolearn: arn:aws:iam::***************:role/datalake.NonProd.dleksworker.svcrole
      username: system:node:{{EC2PrivateDNSName}}
    - groups:
      - system:bootstrappers
      - system:nodes
      - system:node-proxier
      rolearn: arn:aws:iam::**************:role/datalake.NonProd.dleksworker.svcrole
      username: system:node:{{SessionName}}
  mapUsers: |
    []

Kubelet logs after the node goes into unready status

$ journalctl -u kubelet -n 100

Jun 03 18:41:31 ip-10-170-6-20.ec2.internal kubelet[3763]: E0603 18:41:31.003664    3763 reflector.go:126] object-"amazon-cloudwatch"/"cloudwatch-agent-token-z7sxt": Failed to list *v1.Secret: secrets "cloudwatch-agent-token-z7sxt" is forbidden: User "system:node:i-01e7f6b3a2b46b5b3" cannot list resource "secrets" in API group "" in the namespace "amazon-cloudwatch": no relationship found between node "i-01e7f6b3a2b46b5b3" and this object
Jun 03 18:41:31 ip-10-170-6-20.ec2.internal kubelet[3763]: E0603 18:41:31.199923    3763 reflector.go:126] object-"kube-system"/"aws-node-token-msh5s": Failed to list *v1.Secret: secrets "aws-node-token-msh5s" is forbidden: User "system:node:i-01e7f6b3a2b46b5b3" cannot list resource "secrets" in API group "" in the namespace "kube-system": no relationship found between node "i-01e7f6b3a2b46b5b3" and this object
Jun 03 18:41:31 ip-10-170-6-20.ec2.internal kubelet[3763]: E0603 18:41:31.600753    3763 reflector.go:126] object-"kube-system"/"kube-proxy": Failed to list *v1.ConfigMap: configmaps "kube-proxy" is forbidden: User "system:node:i-01e7f6b3a2b46b5b3" cannot list resource "configmaps" in API group "" in the namespace "kube-system": no relationship found between node "i-01e7f6b3a2b46b5b3" and this object
Jun 03 18:41:31 ip-10-170-6-20.ec2.internal kubelet[3763]: E0603 18:41:31.799801    3763 reflector.go:126] object-"kube-system"/"coredns": Failed to list *v1.ConfigMap: configmaps "coredns" is forbidden: User "system:node:i-01e7f6b3a2b46b5b3" cannot list resource "configmaps" in API group "" in the namespace "kube-system": no relationship found between node "i-01e7f6b3a2b46b5b3" and this object
Jun 03 18:41:32 ip-10-170-6-20.ec2.internal kubelet[3763]: E0603 18:41:31.999994    3763 reflector.go:126] object-"amazon-cloudwatch"/"fluentd-config": Failed to list *v1.ConfigMap: configmaps "fluentd-config" is forbidden: User "system:node:i-01e7f6b3a2b46b5b3" cannot list resource "configmaps" in API group "" in the namespace "amazon-cloudwatch": no relationship found between node "i-01e7f6b3a2b46b5b3" and this object
Jun 03 18:41:32 ip-10-170-6-20.ec2.internal kubelet[3763]: E0603 18:41:32.199993    3763 reflector.go:126] object-"kube-system"/"coredns-token-mzr2x": Failed to list *v1.Secret: secrets "coredns-token-mzr2x" is forbidden: User "system:node:i-01e7f6b3a2b46b5b3" cannot list resource "secrets" in API group "" in the namespace "kube-system": no relationship found between node "i-01e7f6b3a2b46b5b3" and this object
Jun 03 18:41:32 ip-10-170-6-20.ec2.internal kubelet[3763]: E0603 18:41:32.400048    3763 reflector.go:126] object-"amazon-cloudwatch"/"cwagentconfig": Failed to list *v1.ConfigMap: configmaps "cwagentconfig" is forbidden: User "system:node:i-01e7f6b3a2b46b5b3" cannot list resource "configmaps" in API group "" in the namespace "amazon-cloudwatch": no relationship found between node "i-01e7f6b3a2b46b5b3" and this object
Jun 03 18:41:32 ip-10-170-6-20.ec2.internal kubelet[3763]: E0603 18:41:32.600073    3763 reflector.go:126] object-"amazon-cloudwatch"/"fluentd-token-6r75w": Failed to list *v1.Secret: secrets "fluentd-token-6r75w" is forbidden: User "system:node:i-01e7f6b3a2b46b5b3" cannot list resource "secrets" in API group "" in the namespace "amazon-cloudwatch": no relationship found between node "i-01e7f6b3a2b46b5b3" and this object
Jun 03 18:41:32 ip-10-170-6-20.ec2.internal kubelet[3763]: E0603 18:41:32.801233    3763 reflector.go:126] object-"kube-system"/"kube-proxy-config": Failed to list *v1.ConfigMap: configmaps "kube-proxy-config" is forbidden: User "system:node:i-01e7f6b3a2b46b5b3" cannot list resource "configmaps" in API group "" in the namespace "kube-system": no relationship found between node "i-01e7f6b3a2b46b5b3" and this object
Jun 03 18:41:33 ip-10-170-6-20.ec2.internal kubelet[3763]: W0603 18:41:33.004111    3763 status_manager.go:501] Failed to update status for pod "fluentd-cloudwatch-7ppvz_amazon-cloudwatch(42c7307e-a5c2-11ea-829d-025f0412bf9f)": failed to patch status "{\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"Initialized\"},{\"type\":\"Ready\"},{\"type\":\"ContainersReady\"},{\"type\":\"PodScheduled\"}],\"conditions\":[{\"status\":\"True\",\"type\":\"Ready\"}]}}" for pod "amazon-cloudwatch"/"fluentd-cloudwatch-7ppvz": pods "fluentd-cloudwatch-7ppvz" is forbidden: node "i-01e7f6b3a2b46b5b3" can only update pod status for pods with spec.nodeName set to itself
Jun 03 18:41:33 ip-10-170-6-20.ec2.internal kubelet[3763]: E0603 18:41:33.200196    3763 reflector.go:126] object-"kube-system"/"kube-proxy-token-wl2v8": Failed to list *v1.Secret: secrets "kube-proxy-token-wl2v8" is forbidden: User "system:node:i-01e7f6b3a2b46b5b3" cannot list resource "secrets" in API group "" in the namespace "kube-system": no relationship found between node "i-01e7f6b3a2b46b5b3" and this object

The text was updated successfully, but these errors were encountered:

datGnomeLife · 2020-06-22T20:40:52Z

I have identified the issue was due to the fargatePodExecutionRoleARN was the same as the instanceRoleARN arn. I have since create a seperate role just for fargate and can create a new cluster and everything works! The issue now is when I try to update the fargatePodExecutionRoleARN on the existing cluster it doesn't create the fargate profile with the new role. Is there a way to force an update of the fargatePodExecutionRoleARN?

datGnomeLife · 2020-06-23T15:51:48Z

I was able to resolve this issue by manually updating the cluster cf stack. If I understand correctly it uses the output of the stack for the fargatePodExecutionRoleARN when creating a fargate profile. It might be worth considering a feature on eksctl update cluster to include a flag like --skip-control-plane when you just want to update the stack and not the control plane

prithviramesh · 2020-06-25T17:02:25Z

Seems related to: kubernetes-sigs/aws-iam-authenticator#271

martina-if · 2020-07-07T13:26:34Z

It might be worth considering a feature on eksctl update cluster to include a flag like --skip-control-plane when you just want to update the stack and not the control plane

Hi @datGnomeLife Indeed, that is a good use case and it's tracked here so I will close this issue.

datGnomeLife added the kind/bug label Jun 4, 2020

datGnomeLife changed the title ~~Unmanaged Nodes Go into unready status when fargate profile is added~~ Unmanaged nodes go into unready status when fargate profile is added Jun 4, 2020

martina-if assigned michaelbeaumont Jun 10, 2020

martina-if added the priority/important-soon Ideally to be resolved in time for the next release label Jun 10, 2020

martina-if closed this as completed Jul 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unmanaged nodes go into unready status when fargate profile is added #2290

Unmanaged nodes go into unready status when fargate profile is added #2290

datGnomeLife commented Jun 4, 2020

datGnomeLife commented Jun 22, 2020 •

edited

Loading

datGnomeLife commented Jun 23, 2020

prithviramesh commented Jun 25, 2020

martina-if commented Jul 7, 2020 •

edited by michaelbeaumont

Loading

Unmanaged nodes go into unready status when fargate profile is added #2290

Unmanaged nodes go into unready status when fargate profile is added #2290

Comments

datGnomeLife commented Jun 4, 2020

datGnomeLife commented Jun 22, 2020 • edited Loading

datGnomeLife commented Jun 23, 2020

prithviramesh commented Jun 25, 2020

martina-if commented Jul 7, 2020 • edited by michaelbeaumont Loading

datGnomeLife commented Jun 22, 2020 •

edited

Loading

martina-if commented Jul 7, 2020 •

edited by michaelbeaumont

Loading