Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws_eks_addon creation race condition #20404

Closed
paulgear opened this issue Aug 2, 2021 · 9 comments · Fixed by #20562
Closed

aws_eks_addon creation race condition #20404

paulgear opened this issue Aug 2, 2021 · 9 comments · Fixed by #20562
Labels
bug Addresses a defect in current functionality. service/eks Issues and PRs that pertain to the eks service.
Milestone

Comments

@paulgear
Copy link

paulgear commented Aug 2, 2021

Description

When created too soon after the EKS cluster (presumably before or during nodegroup creation), the aws_eks_addon resource doesn't always create correctly.

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform CLI and Terraform AWS Provider Version

Terraform v1.0.0
on linux_amd64
+ provider registry.terraform.io/hashicorp/aws v3.52.0
+ provider registry.terraform.io/hashicorp/kubernetes v2.0.2
+ provider registry.terraform.io/hashicorp/null v3.1.0
+ provider registry.terraform.io/hashicorp/template v2.2.0

Affected Resource(s)

  • aws_eks_addon

Terraform Configuration Files

Please include all Terraform configurations required to reproduce the bug. Bug reports without a functional reproduction may be closed without investigation.

resource "aws_eks_cluster" "main" {
  name    = var.cluster_name
  version = var.cluster_version

  role_arn = aws_iam_role.cluster.arn
  ...
  tags = var.tags

  depends_on = [
    aws_cloudwatch_log_group.main,
    aws_iam_role_policy_attachment.cluster_AmazonEKSClusterPolicy,
    aws_iam_role_policy_attachment.cluster_AmazonEKSServicePolicy,
  ]
}

resource "aws_eks_addon" "coredns" {
  cluster_name  = aws_eks_cluster.main.cluster_name
  addon_name    = "coredns"
}

Debug Output

This will be provided later if needed, once I've redacted it sufficiently.

Panic Output

n/a

Expected Behavior

Degraded seems to be a fairly normal state for initial creation of EKS add-ons when the cluster is fairly new. The provider should wait long enough for the add-on to transition from degraded to active.

Actual Behavior

Error when applying initial configuration:

Error: unexpected EKS Add-On (CLUSTERNAME:coredns) state returned during creation: unexpected state 'DEGRADED', wanted target 'ACTIVE'. last error: %!s(<nil>)

A second apply works fine.

Steps to Reproduce

  1. terraform apply

Important Factoids

  • Adding a manual dependency on the nodegroup resource avoids this race.

References

@github-actions github-actions bot added needs-triage Waiting for first response or review from a maintainer. service/eks Issues and PRs that pertain to the eks service. labels Aug 2, 2021
@wcarlsen
Copy link

wcarlsen commented Aug 3, 2021

@paulgear we see this issue too, but adding the manual dependency on the nodegroup resource doesn't work for us. Do you any more insights?

@paulgear
Copy link
Author

paulgear commented Aug 3, 2021

@wcarlsen Maybe try a cluster readiness check like this? https://github.com/cmdlabs/cmd-tf-aws-eks/blob/master/cluster/auth-cm.tf#L22

@z0rc
Copy link

z0rc commented Aug 3, 2021

Actually it's quite possible to create EKS cluster with addons but without any workers, at least AWS Web Console does this when creating cluster. Obviously coredns deployment will be in degraded state until some worker nodes are available.

I think there are two ways to handle this:

  • In terraform code by adding something like depends_on = [aws_eks_node_group.workers] to coredns aws_eks_addon resource.
  • In provider by allowing DEGRADED state in resource configuration or extending error handling.

@wcarlsen
Copy link

wcarlsen commented Aug 3, 2021

Thanks for the input @paulgear, but I still didn't manage to get it working. I also tried out @z0rc's suggestion with adding a dependency between node group workers and the coreDNS addon without any luck. I guess we will have to wait around for the latter to be fixed and do the good "double apply" trick.

@tkjwa
Copy link

tkjwa commented Aug 3, 2021

I'm having the same issue since yesterday. I had a working run before the week-end.
From my TF Cloud runs history on July 30th 2021, 3:30:40 pm

...
aws_eks_addon.k8s_vpc_addon: Creating...
aws_eks_addon.k8s_vpc_addon: Creation complete after 3s [id=platform-staging:vpc-cni]
aws_eks_addon.k8s_proxy_addon: Creation complete after 6s [id=platform-staging:kube-proxy]
aws_eks_addon.k8s_coredns_addon: Still creating... [10s elapsed]
aws_eks_addon.k8s_coredns_addon: Creation complete after 17s [id=platform-staging:coredns]
aws_eks_node_group.node_group: Creating...
aws_eks_node_group.node_group: Still creating... [10s elapsed]
aws_eks_node_group.node_group: Still creating... [20s elapsed]
aws_eks_node_group.node_group: Still creating... [30s elapsed]
...
aws_eks_node_group.node_group: Creation complete after 3m7s [id=platform-staging:platform-node-group-staging]

Apply complete! Resources: 14 added, 0 changed, 0 destroyed.

We can see that the addons were created before the node group without any error, since yesterday i get also the following:

unexpected EKS Add-On (platform-staging:coredns) state returned during creation: unexpected state 'DEGRADED', wanted target 'ACTIVE'. last error: %!s()

If I add a dependency on the addon definition relative to the node group then the creation goes fine but i end up with some ENI and SG left after the cluster deletion :(

@ghost
Copy link

ghost commented Aug 5, 2021

I'm also having the same issue, but with a slightly different use case.

Last week I was able to provision the addon and then patch the deployment to run on fargate (https://docs.aws.amazon.com/eks/latest/userguide/fargate-getting-started.html#fargate-gs-coredns), unfortunately this is no longer possible due to the following error:

unexpected EKS Add-On (example:coredns) state returned during creation: unexpected state 'DEGRADED', wanted target 'ACTIVE'. last error: %!s()

I've also been able to replicate this using older versions of the provider (such as v3.47.0) and this still occurs.

Edit: In this case the EKS cluster is fargate only, with no node groups.

@ewbankkit ewbankkit added bug Addresses a defect in current functionality. and removed needs-triage Waiting for first response or review from a maintainer. labels Aug 5, 2021
abstrask added a commit to dfds/infrastructure-modules that referenced this issue Aug 9, 2021
@abstrask
Copy link

abstrask commented Aug 9, 2021

We use un-managed node groups (aka. plain auto-scaling groups) controlled by one Terraform module, and manage EKS add-ons through another module. This workaround seems to do the trick:

In our main EKS cluster module:

module "eks_addons" {
  source = "../../_sub/compute/eks-addons"
  depends_on = [
    module.eks_cluster,
    module.eks_nodegroup1_workers,
    module.eks_nodegroup2_workers
  ] # added explicit dependencies on node group modules, as a workaround to dfds/cloudplatform#380 and hashicorp/terraform-provider-aws#20404

  ...
}

In our un-managed node group sub-module:

resource "aws_autoscaling_group" "eks" {
  ...

  provisioner "local-exec" {
    command = "sleep 60" # added arbitrary delay to allow ASG to spin up instances, as a workaround to dfds/cloudplatform#380 and hashicorp/terraform-provider-aws#20404
  }
}

See also dfds/infrastructure-modules#276.

abstrask pushed a commit to dfds/infrastructure-modules that referenced this issue Aug 10, 2021
avnes added a commit to dfds/infrastructure-modules that referenced this issue Aug 12, 2021
* Workaround/fix for dfds/cloudplatform#380 and hashicorp/terraform-provider-aws#20404

* Re-enable QA destroy steps

Co-authored-by: abstrask <[email protected]>
Co-authored-by: Rasmus Rask <[email protected]>
@github-actions github-actions bot added this to the v3.55.0 milestone Aug 18, 2021
@github-actions
Copy link

This functionality has been released in v3.55.0 of the Terraform AWS Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 19, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality. service/eks Issues and PRs that pertain to the eks service.
Projects
None yet
6 participants