Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lease: Failed to get lease: leases.coordination.k8s.io #1634

Closed
pratikbin opened this issue Apr 6, 2022 · 20 comments
Closed

Lease: Failed to get lease: leases.coordination.k8s.io #1634

pratikbin opened this issue Apr 6, 2022 · 20 comments
Labels
bug Something isn't working

Comments

@pratikbin
Copy link

Version

Karpenter: v0.8.0

Kubernetes: v1.21.5-eks-bc4871b

Expected Behavior

Actual Behavior

Getting Lease: Failed to get lease: leases.coordination.k8s.io "ip-xx-xx-xx-xx.ap-south-1.compute.internal" not found

Steps to Reproduce the Problem

Got this terraform steps from docs, fixed few deprecated modules inputs.

locals {
  cluster_name = "eks-test"
}

## PHASE 1
## EKS
module "vpc" {
  source = "terraform-aws-modules/vpc/aws"

  name = local.cluster_name
  cidr = "10.0.0.0/16"

  azs             = ["ap-south-1a", "ap-south-1b", "ap-south-1c"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]

  enable_nat_gateway     = true
  single_nat_gateway     = true
  one_nat_gateway_per_az = false

  private_subnet_tags = {
    "kubernetes.io/cluster/${local.cluster_name}" = "owned"
    "karpenter.sh/discovery"                      = local.cluster_name
  }
}

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "18.18.0"


  cluster_version = "1.21"
  cluster_name    = local.cluster_name
  vpc_id          = module.vpc.vpc_id
  subnet_ids      = module.vpc.private_subnets
  enable_irsa     = true

  eks_managed_node_groups = {
    sentries = {
      min_size     = 1
      max_size     = 1
      desired_size = 1

      instance_types = ["c5.xlarge"]
      capacity_type  = "ON_DEMAND"
    }
  }

  tags = {
    "karpenter.sh/discovery" = local.cluster_name
  }
}

## PHASE 2
## Karpenter
resource "aws_iam_service_linked_role" "spot" {
  aws_service_name = "spot.amazonaws.com"
  depends_on       = [module.eks]
}

data "aws_iam_policy" "ssm_managed_instance" {
  arn        = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
  depends_on = [aws_iam_service_linked_role.spot]
}

resource "aws_iam_role_policy_attachment" "karpenter_ssm_policy" {
  role       = module.eks.cluster_iam_role_name
  policy_arn = data.aws_iam_policy.ssm_managed_instance.arn
  depends_on = [aws_iam_service_linked_role.spot]
}

resource "aws_iam_instance_profile" "karpenter" {
  name       = "KarpenterNodeInstanceProfile-${local.cluster_name}"
  role       = module.eks.cluster_iam_role_name
  depends_on = [aws_iam_service_linked_role.spot]
}
module "iam_assumable_role_karpenter" {
  source                        = "terraform-aws-modules/iam/aws//modules/iam-assumable-role-with-oidc"
  version                       = "4.7.0"
  create_role                   = true
  role_name                     = "karpenter-controller-${local.cluster_name}"
  provider_url                  = module.eks.cluster_oidc_issuer_url
  oidc_fully_qualified_subjects = ["system:serviceaccount:karpenter:karpenter"]

  depends_on = [
    aws_iam_role_policy_attachment.karpenter_ssm_policy,
    aws_iam_instance_profile.karpenter
  ]
}

resource "aws_iam_role_policy" "karpenter_controller" {
  name = "karpenter-policy-${local.cluster_name}"
  role = module.iam_assumable_role_karpenter.iam_role_name

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = [
          "ec2:CreateLaunchTemplate",
          "ec2:CreateFleet",
          "ec2:RunInstances",
          "ec2:CreateTags",
          "iam:PassRole",
          "ec2:TerminateInstances",
          "ec2:DescribeLaunchTemplates",
          "ec2:DeleteLaunchTemplate",
          "ec2:DescribeInstances",
          "ec2:DescribeSecurityGroups",
          "ec2:DescribeSubnets",
          "ec2:DescribeInstanceTypes",
          "ec2:DescribeInstanceTypeOfferings",
          "ec2:DescribeAvailabilityZones",
          "ssm:GetParameter"
        ]
        Effect   = "Allow"
        Resource = "*"
      },
    ]
  })

  depends_on = [
    aws_iam_role_policy_attachment.karpenter_ssm_policy,
    aws_iam_instance_profile.karpenter
  ]
}

resource "local_file" "basic" {
  filename = "basic"
  content  = <<EOF
helm repo add karpenter https://charts.karpenter.sh
helm upgrade --install karpenter karpenter/karpenter \
  --version 0.8.0 \
  --create-namespace \
  --namespace karpenter \
  --set serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn='${module.iam_assumable_role_karpenter.iam_role_arn}' \
  --set clusterName='${local.cluster_name}' \
  --set clusterEndpoint='${module.eks.cluster_endpoint}' \
  --set aws.defaultInstanceProfile='${aws_iam_instance_profile.karpenter.name}'

## karpenter provisioner
cat <<EOT | kubectl apply -f -
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: mumbai
spec:
  requirements:
    - key: "topology.kubernetes.io/zone"
      operator: In
      values: ["ap-south-1a", "ap-south-1b", "ap-south-1c"]
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["on-demand"]
    - key: node.kubernetes.io/instance-type
      operator: In
      values: ["t3.2xlarge", "t2.2xlarge", "t3a.2xlarge", "c5.4xlarge"]
  provider:
    subnetSelector:
      karpenter.sh/discovery: eks-test
    securityGroupSelector:
      karpenter.sh/discovery: eks-test
  ttlSecondsAfterEmpty: 10
EOT

## Test it with demo Deployment and scale it
cat <<EOT | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
spec:
  replicas: 0
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      terminationGracePeriodSeconds: 0
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.2
          resources:
            requests:
              cpu: 1
EOT

kubectl scale deployment inflate --replicas 20
EOF
}

This will create file basic with necessary helm, provisioner etc. fields filled up.

image

Resource Specs and Logs

webhook logs
https://gist.github.com/pratikbin/5e2f1c54032c6a8c43d4e60e1648c481

controller logs
https://gist.github.com/pratikbin/3db319cd9195818f6c814ce8c55644fe

@pratikbin pratikbin added the bug Something isn't working label Apr 6, 2022
@bwagner5
Copy link
Contributor

bwagner5 commented Apr 6, 2022

Do you have an entry like this in your aws-auth config map?

kubectl get configmap aws-auth -n kube-system -o yaml
apiVersion: v1
data:
  mapRoles: |
    - groups:
      - system:bootstrappers
      - system:nodes
      rolearn: arn:aws:iam::1234567890:role/eksctl-karpenter-demo-nodegroup-k-NodeInstanceRole-YBGH50RFGIEL
      username: system:node:{{EC2PrivateDNSName}}

@pratikbin
Copy link
Author

Yes

apiVersion: v1
data:
  mapRoles: |
    - groups:
      - system:bootstrappers
      - system:nodes
      rolearn: arn:aws:iam::xxxx:role/sentries-eks-node-group-xxxx
      username: system:node:{{EC2PrivateDNSName}}
kind: ConfigMap
metadata:
  name: aws-auth
  namespace: kube-system

@pratikbin
Copy link
Author

pratikbin commented Apr 7, 2022

FYI I was able to do karpenter(verb) successfully with eksctl method following karpenter docs that was working perfectly.

@bwagner5
Copy link
Contributor

bwagner5 commented Apr 7, 2022

There are a few Terraform related PRs out that may be worth checking out, so I'll link them here:
#1551
#1332

I'm going to look into the current terraform getting started guide today and see if I can get it working.

@bwagner5
Copy link
Contributor

bwagner5 commented Apr 7, 2022

EDIT: nvm I see you upgraded to >18 and that doesn't exist :( let me track this down more. I still think it's something to do with that role.

actually I think this:

resource "aws_iam_instance_profile" "karpenter" {
  name       = "KarpenterNodeInstanceProfile-${local.cluster_name}"
  role       = module.eks.cluster_iam_role_name
  depends_on = [aws_iam_service_linked_role.spot]
}

should be:

resource "aws_iam_instance_profile" "karpenter" {
  name       = "KarpenterNodeInstanceProfile-${local.cluster_name}"
  role       = module.eks.worker_iam_role_name
  depends_on = [aws_iam_service_linked_role.spot]
}

Specifically this line should be worker_iam_role_name rather than cluster_iam_role_name:
role = module.eks.worker_iam_role_name

https://karpenter.sh/preview/getting-started/getting-started-with-terraform/#:~:text=role%20%3D%20module.eks.worker_iam_role_name

@bwagner5
Copy link
Contributor

bwagner5 commented Apr 7, 2022

These are the managed policies that the Node needs. Can you try adding these to an IAM Role you create in Terraform and passing that new role's instance profile to Karpenter?

arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy

@bryantbiggs
Copy link
Member

@pratikbin
Copy link
Author

@bryantbiggs thanks I'll try it

@pratikbin
Copy link
Author

EDIT: nvm I see you upgraded to >18 and that doesn't exist :( let me track this down more. I still think it's something to do with that role.

actually I think this:

resource "aws_iam_instance_profile" "karpenter" {
  name       = "KarpenterNodeInstanceProfile-${local.cluster_name}"
  role       = module.eks.cluster_iam_role_name
  depends_on = [aws_iam_service_linked_role.spot]
}

should be:

resource "aws_iam_instance_profile" "karpenter" {
  name       = "KarpenterNodeInstanceProfile-${local.cluster_name}"
  role       = module.eks.worker_iam_role_name
  depends_on = [aws_iam_service_linked_role.spot]
}

Specifically this line should be worker_iam_role_name rather than cluster_iam_role_name: role = module.eks.worker_iam_role_name

https://karpenter.sh/preview/getting-started/getting-started-with-terraform/#:~:text=role%20%3D%20module.eks.worker_iam_role_name

There is no worker_iam_role_name by EKS module

@bryantbiggs
Copy link
Member

Karpenter requires at least one node to get started (something to host pods to start controlling scaling as well as running CoreDNS, VPC CNI, etc.). Here is an example of just deploying an EKS managed node group with a single node https://github.com/terraform-aws-modules/terraform-aws-eks/blob/3ff17205a4ead51cca993547ef3de42cc080043b/examples/karpenter/main.tf#L45-L56

So with that node group, an IAM role is created by the module and we can reference that to create an instance profile for Karpenter https://github.com/terraform-aws-modules/terraform-aws-eks/blob/3ff17205a4ead51cca993547ef3de42cc080043b/examples/karpenter/main.tf#L119-L122

@pratikbin
Copy link
Author

FYI - working example that you can refer to @pratikbin https://github.com/clowdhaus/eks-reference-architecture/tree/main/karpenter

@bryantbiggs It worked. Now I'll compare your's with mine.

Here is an example of just deploying an EKS managed node group with a single node
So with that node group, an IAM role is created by the module and we can reference that to create an instance profile for Karpenter https://github.com/terraform-aws-modules/terraform-aws-eks/blob/3ff17205a4ead51cca993547ef3de42cc080043b/examples/karpenter/main.tf#L119-L122

Yes I know that. It was somekind of permission issue I bet but which permissions that I have to find.

@dheeraj-incred
Copy link

@pratikbin were you able to figure out this?
I'm trying to add karpenter to an existing EKS cluster but new nodes created by karpenter aren't joining the cluster due to the same lease error.

@pratikbin
Copy link
Author

@dheeraj-incred no I haven't dug deep but @bryantbiggs TF worked for me. He's using module for karpenter.

@vumdao
Copy link

vumdao commented Apr 26, 2022

I faced this issue and fixed it. In my case, it was due to the karpenter node had not been allow traffic to ECR service endpint which is put in private network.

How did I figureout the rootcause? Reference to Troubleshoot Karpenter Node

@dmitry-mightydevops
Copy link

I had exactly the same issue, but my problem was that I mixed karpenter + managed eks groups (intentionally) and then my tag value for Provisioner.spec.provider.securityGroupSelector was selecting only "partial" of security groups, which resulted that the new node had NO security group which allows communication with the control plane of the cluster. Once I fixed it - everything worked like a charm!

@dewjam
Copy link
Contributor

dewjam commented May 10, 2022

Hey all, it looks like this issue is now resolved. I'm going to close it out, but feel free to reopen if you're still having issues.

Thanks!

@dewjam dewjam closed this as completed May 10, 2022
@IronforgeV
Copy link

FYI - working example that you can refer to @pratikbin https://github.com/clowdhaus/eks-reference-architecture/tree/main/karpenter

tags = merge(module.tags.tags, { # This will tag the launch template created for use by Karpenter "karpenter.sh/discovery/${local.name}" = local.name })

this should have never worked in your example because the tags are applied outside the launch template and at the cluster level .... they need to be inside the eks_managed_node_groups block

@michaelswierszcz
Copy link

was running a custom AMI and had this error, my fix was changing

amiFamily: Custom

to

amiFamily: AL2

@hawkesn
Copy link
Contributor

hawkesn commented Nov 2, 2022

Commenting on this for posterity.

You can also run into this issue if you (accidentally) set the clusterEndpoint in the Helm Chart to the wrong cluster and the clusterName to another cluster. kube-proxy can fail to connect.

The Troubleshooting guide helped me dig into the Kubelet logs and I found:

Failed to contact API server when waiting for CSINode publishing: Get "https://xxxxxxxxx.gr7.us-east-1.eks.amazonaws.com/apis/storage.k8s.io/v1/csinodes/ip-xxxxxxxx.ec2.internal": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes"

Cert error is because it's trying to connect with the wrong certificate.

Totally didn't happen to me 😄

@anapsix
Copy link

anapsix commented May 31, 2023

== heads-up ==

Another "operator error" type issue can occur when one has a Karpenter provisioner with containerRuntime: dockerd runtime, while the cluster have been upgraded to K8s v1.24+, which removed support for dockershim (unless custom AMIs are used).
In this scenario, the kubelet fails to start, and nodes fail to register (which becomes evident after checking kubelet logs on EKS node). Updating provisioner config to use the containerd runtime resolves an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests