Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter doesn't create Nodes #1225

Closed
Izvi-digibank opened this issue Jan 26, 2022 · 16 comments
Closed

Karpenter doesn't create Nodes #1225

Izvi-digibank opened this issue Jan 26, 2022 · 16 comments
Assignees
Labels
bug Something isn't working burning Time sensitive issues

Comments

@Izvi-digibank
Copy link

Installed Karpenter following the documentation. Created the following provisioner:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: provisioner-test
spec:
  requirements:
    - key: dwh
      operator: In
      values: ["yes-dwh"]
  taints:
    - key: dwh
      value: cronjobs-test
      effect: "NoSchedule"
  limits:
    resources:
      cpu: 1000
  provider:
    instanceProfile: KarpenterNodeInstanceProfile-features
    subnetSelector: 
      kubernetes.io/cluster/features: '*'
    securityGroupSelector:
      Name: sg-***
  ttlSecondsAfterEmpty: 30

The resource I am trying to match with the provisioner above is:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: dwh-cron
spec:
  jobTemplate:
    spec:
      template:
        spec:
          tolerations:
            - key: "dwh"
              value: cronjobs-test
          nodeSelector:
            dwh: "yes-dwh"

The error I get:
2022-01-26T16:07:43.841Z DEBUG controller.selection Could not schedule pod, matched 0/1 provisioners, tried provisioner/provisioner-test: invalid nodeSelector "dwh", [yes-dwh] not in [] {"commit": "5047f3c", "pod": "dwh-dev/karpenter-test-4vgnb"}

Would appreciate your assistance on this issue. Thanks in advance.

@ellistarn
Copy link
Contributor

Currently, only well known labels are supported via requirements. For custom labels, use the explicit labels syntax:

labels:
  dwh: yes-dwh

I'm surprised our validation logic allowed this. @felix-zhe-huang can you take a look?

@Izvi-digibank
Copy link
Author

@ellistarn Thank you.
Looks like node it being created but never comes alive:

2022-01-26T18:06:09.131Z	INFO	controller.provisioning	Batched 1 pods in 1.00072609s	{"commit": "5047f3c", "provisioner": "provisioner-test"}
2022-01-26T18:06:09.138Z	DEBUG	controller.provisioning	Excluding instance type t3.nano because there are not enough resources for kubelet and system overhead	{"commit": "5047f3c", "provisioner": "provisioner-test"}
2022-01-26T18:06:09.140Z	DEBUG	controller.provisioning	Excluding instance type t3a.nano because there are not enough resources for kubelet and system overhead	{"commit": "5047f3c", "provisioner": "provisioner-test"}
2022-01-26T18:06:09.217Z	INFO	controller.provisioning	Computed packing of 1 node(s) for 1 pod(s) with instance type option(s) [m1.small m1.medium m3.medium t3.micro t3a.micro c1.medium t3.small t3a.small c3.large c4.large c5d.large c5a.large t3.medium c6i.large t3a.medium c5ad.large c5.large c5n.large m3.large m1.large]	{"commit": "5047f3c", "provisioner": "provisioner-test"}
2022-01-26T18:06:09.268Z	DEBUG	controller.provisioning	Discovered security groups: [sg-08162fc9a077d5ff8 sg-0b2181de7540db421]	{"commit": "5047f3c", "provisioner": "provisioner-test"}
2022-01-26T18:06:09.268Z	DEBUG	controller.provisioning	Ignoring security group sg-0b2181de7540db421, only one group with tag kubernetes.io/cluster/features is allowed	{"commit": "5047f3c", "provisioner": "provisioner-test"}
2022-01-26T18:06:09.271Z	DEBUG	controller.provisioning	Discovered kubernetes version 1.21	{"commit": "5047f3c", "provisioner": "provisioner-test"}
2022-01-26T18:06:09.338Z	DEBUG	controller.provisioning	Discovered ami ami-0adc757be1e4e11a1 for query /aws/service/eks/optimized-ami/1.21/amazon-linux-2/recommended/image_id  	{"commit": "5047f3c", "provisioner": "provisioner-test"}
2022-01-26T18:06:09.338Z	DEBUG	controller.provisioning	Discovered caBundle, length 1025	{"commit": "5047f3c", "provisioner": "provisioner-test"}
2022-01-26T18:06:09.460Z	DEBUG	controller.provisioning	Created launch template, Karpenter-features-3699898603812528085	{"commit": "5047f3c", "provisioner": "provisioner-test"}
2022-01-26T18:06:11.322Z	INFO	controller.provisioning	Launched instance: i-0db1827fa220659b2, hostname: ip-172-31-4-22.eu-west-1.compute.internal, type: t3a.micro, zone: eu-west-1b, capacityType: on-demand	{"commit": "5047f3c", "provisioner": "provisioner-test"}
2022-01-26T18:06:11.349Z	INFO	controller.provisioning	Bound 1 pod(s) to node ip-172-31-4-22.eu-west-1.compute.internal	{"commit": "5047f3c", "provisioner": "provisioner-test"}
➜  kubectl get nodes
NAME                                          STATUS     ROLES    AGE     VERSION
ip-172-31-4-22.eu-west-1.compute.internal     NotReady   <none>   10m

Screen Shot 2022-01-26 at 20 19 44
Screen Shot 2022-01-26 at 20 21 06

New nodes are trying to come up every ~6min
Status of all is "unknown".

Can you please suggest?

@ellistarn
Copy link
Contributor

Are you following the getting started guide? There are many reasons the node can't connect.

  • instance profile needs the right permissions
  • security groups need connectivity to the masters
  • iam role needs to be granted access.

Try logging into the node with

aws ssm start-session --target $(kubectl get node -l karpenter.sh/provisioner-name -ojson | jq -r ".items[0].spec.providerID" | cut -d \/ -f5)

and then reading the kubelet logs with

sudo journalctl -u kubelet

@alekc
Copy link
Contributor

alekc commented Jan 26, 2022

@Izvi-digibank check your subnet selectors. I had a similar issue, it was caused by nodes binding to public subnet instead of private.

@Izvi-digibank
Copy link
Author

Izvi-digibank commented Jan 27, 2022

@ellistarn I am attaching all the relevant configurations, all followed by the documentation. @alekc subnet selector is 100% private. Would appreciate your further attention to the following details;

Instance profile has the right permissions (Policy is "AmazonSSMManagedInstanceCore"):

Screen Shot 2022-01-27 at 18 18 38

AmazonSSMManagedInstanceCore policy :

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ssm:DescribeAssociation",
                "ssm:GetDeployablePatchSnapshotForInstance",
                "ssm:GetDocument",
                "ssm:DescribeDocument",
                "ssm:GetManifest",
                "ssm:GetParameter",
                "ssm:GetParameters",
                "ssm:ListAssociations",
                "ssm:ListInstanceAssociations",
                "ssm:PutInventory",
                "ssm:PutComplianceItems",
                "ssm:PutConfigurePackageResult",
                "ssm:UpdateAssociationStatus",
                "ssm:UpdateInstanceAssociationStatus",
                "ssm:UpdateInstanceInformation"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ssmmessages:CreateControlChannel",
                "ssmmessages:CreateDataChannel",
                "ssmmessages:OpenControlChannel",
                "ssmmessages:OpenDataChannel"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2messages:AcknowledgeMessage",
                "ec2messages:DeleteMessage",
                "ec2messages:FailMessage",
                "ec2messages:GetEndpoint",
                "ec2messages:GetMessages",
                "ec2messages:SendReply"
            ],
            "Resource": "*"
        }
    ]
}

KarpenterNodeInstanceProfile-features Trust relationship:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

KarpenterController IAM Role policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "ec2:CreateLaunchTemplate",
                "ec2:CreateFleet",
                "ec2:RunInstances",
                "ec2:CreateTags",
                "iam:PassRole",
                "ec2:TerminateInstances",
                "ec2:DescribeLaunchTemplates",
                "ec2:DescribeInstances",
                "ec2:DescribeSecurityGroups",
                "ec2:DescribeSubnets",
                "ec2:DescribeInstanceTypes",
                "ec2:DescribeInstanceTypeOfferings",
                "ec2:DescribeAvailabilityZones",
                "ssm:GetParameter"
            ],
            "Effect": "Allow",
            "Resource": "*"
        }
    ]
}

KarpenterController Trust Relationship:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::***:oidc-provider/oidc.eks.eu-west-1.amazonaws.com/id/***"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringLike": {
          "oidc.eks.eu-west-1.amazonaws.com/id/***": "system:serviceaccount:karpenter:karpenter"
        }
      }
    }
  ]
}

Karpenter 0.5.3 helm chart values file:


serviceAccount:
  # -- Create a service account for the application controller
  create: true
  # -- Service account name
  name: karpenter
  # -- Annotations to add to the service account (like the ARN of the IRSA role)
  annotations: {eks.amazonaws.com/role-arn: arn:aws:iam::***:role/KarpenterController}
    
controller:
  # -- Additional environment variables to run with
  ## - name: AWS_REGION
  ## - value: eu-west-1
  env: []
  # -- Node selectors to schedule to nodes with labels.
  nodeSelector: {}
  # -- Tolerations to schedule to nodes with taints.
  tolerations: []
  # -- Affinity rules for scheduling
  affinity: {}
  # -- Image to use for the Karpenter controller
  image: "public.ecr.aws/karpenter/controller:v0.5.3@sha256:ddd24d756cb324cf8f91f2274621646f83d6121ed6856312ca672a5f78c57174"
  # -- Cluster name
  clusterName: "features"
  # -- Cluster endpoint
  clusterEndpoint: "https://***.gr7.eu-west-1.eks.amazonaws.com"
  resources:
    requests:
      cpu: 1
      memory: 1Gi
    limits:
      cpu: 1
      memory: 1Gi
  replicas: 1
webhook:
  # -- List of environment items to add to the webhook
  env: []
  # -- Node selectors to schedule to nodes with labels.
  nodeSelector: {}
  # -- Tolerations to schedule to nodes with taints.
  tolerations: []
  # -- Affinity rules for scheduling
  affinity: {}
  # -- Image to use for the webhook
  image: "public.ecr.aws/karpenter/webhook:v0.5.3@sha256:19a1e1f2c8ec6ece1b170584dd6251d2e00f1676503a65d1433f45f46e330ddf"
  # -- Set to true if using custom CNI on EKS
  hostNetwork: true
  port: 8443
  resources:
    limits:
      cpu: 200m
      memory: 100Mi
    requests:
      cpu: 200m
      memory: 100Mi
  replicas: 1

Also, I'd expect to see logs while trying to create the node.
@ellistarn
aws ssm start-session --target $(kubectl get node -l karpenter.sh/provisioner-name -ojson | jq -r ".items[0].spec.providerID" | cut -d \/ -f5)

I get i-0c8ddaf2ca6b7427f which is the instance.

@ellistarn ellistarn added bug Something isn't working burning Time sensitive issues labels Jan 27, 2022
@ellistarn
Copy link
Contributor

Can you connect using aws ssm start-session --target i-0c8ddaf2ca6b7427f and then check the kubelet logs, mentioned above?

@Izvi-digibank
Copy link
Author

Izvi-digibank commented Jan 27, 2022

➜  ~ aws ssm start-session --target i-09986a80810518f03 --profile dev

Starting session with SessionId: iz@digibank-0276ae4db2a9663f3
sh-4.2$
sh-4.2$
sh-4.2$ sudo journalctl -u kubelet
-- No entries --

@ellistarn No logs visible :(
Did you happen to view the details I added in the last comment? Are all my configurations looks okay to you?

@ellistarn
Copy link
Contributor

ellistarn commented Jan 27, 2022

Your instance profile needs the 4 policies:

      ManagedPolicyArns:
        - !Sub "arn:${AWS::Partition}:iam::aws:policy/AmazonEKS_CNI_Policy"
        - !Sub "arn:${AWS::Partition}:iam::aws:policy/AmazonEKSWorkerNodePolicy"
        - !Sub "arn:${AWS::Partition}:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
        - !Sub "arn:${AWS::Partition}:iam::aws:policy/AmazonSSMManagedInstanceCore"

@Izvi-digibank
Copy link
Author

Thanks, added those policies. In my opinion it's not clear enough in the documentation, I'd suggest en edit.

New nodes is still in unknown state, however I was able to get some Kubelet logs:

Jan 27 20:03:42 ip-172-31-9-245.eu-west-1.compute.internal kubelet[3183]: I0127 20:03:42.631981    3183 csi_plugin.go:1024] Failed to contact API server when waiting for CSINode publishing: Unauthorized
Jan 27 20:03:42 ip-172-31-9-245.eu-west-1.compute.internal kubelet[3183]: E0127 20:03:42.645985    3183 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-245.eu-west-1.compute.internal\" not found"
Jan 27 20:03:42 ip-172-31-9-245.eu-west-1.compute.internal kubelet[3183]: E0127 20:03:42.747172    3183 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-245.eu-west-1.compute.internal\" not found"
Jan 27 20:03:42 ip-172-31-9-245.eu-west-1.compute.internal kubelet[3183]: E0127 20:03:42.847984    3183 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-245.eu-west-1.compute.internal\" not found"
Jan 27 20:03:42 ip-172-31-9-245.eu-west-1.compute.internal kubelet[3183]: E0127 20:03:42.945556    3183 eviction_manager.go:255] "Eviction manager: failed to get summary stats" err="failed to get node info: node \"ip-172-31-9-245.eu-west-1.compute.internal\" not found"
Jan 27 20:03:42 ip-172-31-9-245.eu-west-1.compute.internal kubelet[3183]: E0127 20:03:42.948822    3183 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-245.eu-west-1.compute.internal\" not found"
Jan 27 20:03:42 ip-172-31-9-245.eu-west-1.compute.internal kubelet[3183]: E0127 20:03:42.980458    3183 kubelet.go:2214] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
Jan 27 20:03:43 ip-172-31-9-245.eu-west-1.compute.internal kubelet[3183]: E0127 20:03:43.049754    3183 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-245.eu-west-1.compute.internal\" not found"
Jan 27 20:03:43 ip-172-31-9-245.eu-west-1.compute.internal kubelet[3183]: E0127 20:03:43.150725    3183 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-245.eu-west-1.compute.internal\" not found"
Jan 27 20:03:43 ip-172-31-9-245.eu-west-1.compute.internal kubelet[3183]: E0127 20:03:43.251498    3183 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-245.eu-west-1.compute.internal\" not found"
Jan 27 20:03:43 ip-172-31-9-245.eu-west-1.compute.internal kubelet[3183]: E0127 20:03:43.352811    3183 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-245.eu-west-1.compute.internal\" not found"
Jan 27 20:03:43 ip-172-31-9-245.eu-west-1.compute.internal kubelet[3183]: E0127 20:03:43.453649    3183 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-245.eu-west-1.compute.internal\" not found"
Jan 27 20:03:43 ip-172-31-9-245.eu-west-1.compute.internal kubelet[3183]: E0127 20:03:43.553948    3183 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-245.eu-west-1.compute.internal\" not found"
Jan 27 20:03:43 ip-172-31-9-245.eu-west-1.compute.internal kubelet[3183]: I0127 20:03:43.632164    3183 csi_plugin.go:1024] Failed to contact API server when waiting for CSINode publishing: Unauthorized
Jan 27 20:03:43 ip-172-31-9-245.eu-west-1.compute.internal kubelet[3183]: E0127 20:03:43.655409    3183 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-245.eu-west-1.compute.internal\" not found"
Jan 27 20:03:43 ip-172-31-9-245.eu-west-1.compute.internal kubelet[3183]: E0127 20:03:43.756325    3183 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-245.eu-west-1.compute.internal\" not found"
Jan 27 20:03:43 ip-172-31-9-245.eu-west-1.compute.internal kubelet[3183]: E0127 20:03:43.857267    3183 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-245.eu-west-1.compute.internal\" not found"
Jan 27 20:03:43 ip-172-31-9-245.eu-west-1.compute.internal kubelet[3183]: E0127 20:03:43.958160    3183 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-245.eu-west-1.compute.internal\" not found"
Jan 27 20:03:44 ip-172-31-9-245.eu-west-1.compute.internal kubelet[3183]: E0127 20:03:44.059166    3183 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-245.eu-west-1.compute.internal\" not found"
Jan 27 20:03:44 ip-172-31-9-245.eu-west-1.compute.internal kubelet[3183]: E0127 20:03:44.160706    3183 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-245.eu-west-1.compute.internal\" not found"
Jan 27 20:03:44 ip-172-31-9-245.eu-west-1.compute.internal kubelet[3183]: E0127 20:03:44.262109    3183 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-245.eu-west-1.compute.internal\" not found"
Jan 27 20:03:44 ip-172-31-9-245.eu-west-1.compute.internal kubelet[3183]: E0127 20:03:44.363362    3183 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-245.eu-west-1.compute.internal\" not found"
Jan 27 20:03:44 ip-172-31-9-245.eu-west-1.compute.internal kubelet[3183]: E0127 20:03:44.464479    3183 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-245.eu-west-1.compute.internal\" not found"
Jan 27 20:03:44 ip-172-31-9-245.eu-west-1.compute.internal kubelet[3183]: E0127 20:03:44.565171    3183 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-245.eu-west-1.compute.internal\" not found"

@ellistarn Any advise?

@alekc
Copy link
Contributor

alekc commented Jan 27, 2022

@Izvi-digibank
from your logs: Failed to contact API server when waiting for CSINode publishing: Unauthorized

I do not see

set {
    name  = "aws.defaultInstanceProfile"
    value = aws_iam_instance_profile.karpenter.name
  }

(check the docs https://karpenter.sh/v0.5.6/getting-started-with-terraform/#install-karpenter-helm-chart)

@Izvi-digibank
Copy link
Author

Izvi-digibank commented Jan 27, 2022

@alekc I'm using v0.5.3 https://karpenter.sh/v0.5.3/getting-started-with-terraform/
There's no request for this parameter in this version's documentation.


Edit: I upgraded to v0.5.6, added the value of aws.defaultInstanceProfile to the helm chart.
Still get same results. Nothing changed.

Are my Trust relationships looks okay? for both KarpenterController and KarpenterNodeInstanceProfile-features?

@Izvi-digibank
Copy link
Author

Jan 27 21:23:56 ip-172-31-9-51.eu-west-1.compute.internal kubelet[3130]: I0127 21:23:56.669754    3130 kubelet_node_status.go:429] "Adding node label from cloud provider" labelKey="failure-domain.beta.kubernetes.io/region" labelValue="eu-west-1"
Jan 27 21:23:56 ip-172-31-9-51.eu-west-1.compute.internal kubelet[3130]: I0127 21:23:56.669767    3130 kubelet_node_status.go:431] "Adding node label from cloud provider" labelKey="topology.kubernetes.io/region" labelValue="eu-west-1"
Jan 27 21:23:56 ip-172-31-9-51.eu-west-1.compute.internal kubelet[3130]: I0127 21:23:56.671629    3130 kubelet_node_status.go:554] "Recording event message for node" node="ip-172-31-9-51.eu-west-1.compute.internal" event="NodeHasSufficientMemory"
Jan 27 21:23:56 ip-172-31-9-51.eu-west-1.compute.internal kubelet[3130]: I0127 21:23:56.672004    3130 kubelet_node_status.go:554] "Recording event message for node" node="ip-172-31-9-51.eu-west-1.compute.internal" event="NodeHasNoDiskPressure"
Jan 27 21:23:56 ip-172-31-9-51.eu-west-1.compute.internal kubelet[3130]: I0127 21:23:56.672287    3130 kubelet_node_status.go:554] "Recording event message for node" node="ip-172-31-9-51.eu-west-1.compute.internal" event="NodeHasSufficientPID"
Jan 27 21:23:56 ip-172-31-9-51.eu-west-1.compute.internal kubelet[3130]: I0127 21:23:56.672558    3130 kubelet_node_status.go:71] "Attempting to register node" node="ip-172-31-9-51.eu-west-1.compute.internal"
Jan 27 21:23:56 ip-172-31-9-51.eu-west-1.compute.internal kubelet[3130]: E0127 21:23:56.696006    3130 kubelet_node_status.go:93] "Unable to register node with API server" err="Unauthorized" node="ip-172-31-9-51.eu-west-1.compute.internal"
Jan 27 21:23:56 ip-172-31-9-51.eu-west-1.compute.internal kubelet[3130]: E0127 21:23:56.739358    3130 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-51.eu-west-1.compute.internal\" not found"
Jan 27 21:23:56 ip-172-31-9-51.eu-west-1.compute.internal kubelet[3130]: E0127 21:23:56.839931    3130 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-51.eu-west-1.compute.internal\" not found"
Jan 27 21:23:56 ip-172-31-9-51.eu-west-1.compute.internal kubelet[3130]: E0127 21:23:56.941040    3130 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-51.eu-west-1.compute.internal\" not found"
Jan 27 21:23:57 ip-172-31-9-51.eu-west-1.compute.internal kubelet[3130]: E0127 21:23:57.041166    3130 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-51.eu-west-1.compute.internal\" not found"
Jan 27 21:23:57 ip-172-31-9-51.eu-west-1.compute.internal kubelet[3130]: E0127 21:23:57.141854    3130 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-51.eu-west-1.compute.internal\" not found"
Jan 27 21:23:57 ip-172-31-9-51.eu-west-1.compute.internal kubelet[3130]: E0127 21:23:57.242647    3130 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-51.eu-west-1.compute.internal\" not found"
Jan 27 21:23:57 ip-172-31-9-51.eu-west-1.compute.internal kubelet[3130]: E0127 21:23:57.343159    3130 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-51.eu-west-1.compute.internal\" not found"
Jan 27 21:23:57 ip-172-31-9-51.eu-west-1.compute.internal kubelet[3130]: E0127 21:23:57.444323    3130 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-51.eu-west-1.compute.internal\" not found"
Jan 27 21:23:57 ip-172-31-9-51.eu-west-1.compute.internal kubelet[3130]: I0127 21:23:57.488624    3130 csi_plugin.go:1024] Failed to contact API server when waiting for CSINode publishing: Unauthorized
Jan 27 21:23:57 ip-172-31-9-51.eu-west-1.compute.internal kubelet[3130]: E0127 21:23:57.545135    3130 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-51.eu-west-1.compute.internal\" not found"
Jan 27 21:23:57 ip-172-31-9-51.eu-west-1.compute.internal kubelet[3130]: E0127 21:23:57.645959    3130 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-51.eu-west-1.compute.internal\" not found"
Jan 27 21:23:57 ip-172-31-9-51.eu-west-1.compute.internal kubelet[3130]: E0127 21:23:57.747037    3130 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-51.eu-west-1.compute.internal\" not found"
Jan 27 21:23:57 ip-172-31-9-51.eu-west-1.compute.internal kubelet[3130]: E0127 21:23:57.848239    3130 kubelet.go:2294] "Error getting node" err="node \"ip-172-31-9-51.eu-west-1.compute.internal\" not found"
Jan 27 21:23:57 ip-172-31-9-51.eu-west-1.compute.internal kubelet[3130]: E0127 21:23:57.855915    3130 kubelet.go:2214] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"

Got some new logs from Kubelet.

@ellistarn
Copy link
Contributor

Jan 27 21:23:56 ip-172-31-9-51.eu-west-1.compute.internal kubelet[3130]: E0127 21:23:56.696006 3130 kubelet_node_status.go:93] "Unable to register node with API server" err="Unauthorized" node="ip-172-31-9-51.eu-west-1.compute.internal"

Your node can't communicate with the API Server.

Here's an example of my aws-auth configmap

k get configmaps -n kube-system aws-auth -oyaml
apiVersion: v1
data:
  mapRoles: |
    - groups:
      - system:bootstrappers
      - system:nodes
      rolearn: arn:aws:iam::767520670908:role/KarpenterNodeRole-etarn
      username: system:node:{{EC2PrivateDNSName}}
  mapUsers: |
    []
kind: ConfigMap
metadata:
  name: aws-auth
  namespace: kube-system

In the future, I highly recommend following or directly translating one of the guides.

@Izvi-digibank
Copy link
Author

Your'e correct, apparently my aws-auth has not got updated. closing this thread. Thanks.

@devopsjnr
Copy link

@ellistarn @felix-zhe-huang I encounter the same error, could you please take a look? #1683

@kaiohenricunha
Copy link

@Izvi-digibank check your subnet selectors. I had a similar issue, it was caused by nodes binding to public subnet instead of private.

That was my problem too. Removing the discovery tag from public subnets and then deleting the stuck nodeclaim and instance rapidly resolved the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working burning Time sensitive issues
Projects
None yet
Development

No branches or pull requests

6 participants