Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OIDC provider thumbprints inconsistent and sometimes incorrect #2768

Closed
2 tasks done
danielfrankcom opened this issue Oct 6, 2023 · 14 comments · Fixed by #2778
Closed
2 tasks done

OIDC provider thumbprints inconsistent and sometimes incorrect #2768

danielfrankcom opened this issue Oct 6, 2023 · 14 comments · Fixed by #2778

Comments

@danielfrankcom
Copy link
Contributor

Description

When I use terraform plan the thumbprint_list of the included oidc_provider changes regularly, even when run back-to-back with no code changes. Additionally, the correct root thumbprint is not always included.

  • ✋ I have searched the open/closed issues and my issue is not listed.
  • ✋ I cleared the .terraform cache and am able to reproduce the issue with a fresh init/apply.

Versions

  • Module version:

    • eks: 19.16.0
    • vpc: 5.0.0
  • Terraform & provider version(s):

$ terraform providers -version
Terraform v1.6.0
on windows_amd64
+ provider registry.terraform.io/hashicorp/aws v5.19.0
+ provider registry.terraform.io/hashicorp/cloudinit v2.3.2
+ provider registry.terraform.io/hashicorp/kubernetes v2.23.0
+ provider registry.terraform.io/hashicorp/time v0.9.1
+ provider registry.terraform.io/hashicorp/tls v4.0.4

Reproduction Code

provider "aws" {
  region  = "ca-central-1"
  profile = "sandbox"
}

data "aws_availability_zones" "available" {}

locals {
  vpc_cidr = "10.0.0.0/16"
  azs      = slice(data.aws_availability_zones.available.names, 0, 3)
}

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.0.0"

  name = "bug-report"

  azs             = local.azs
  private_subnets = [for k, v in local.azs : cidrsubnet(local.vpc_cidr, 4, k)]
}

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "19.16.0"

  cluster_name    = "bug-report"
  cluster_version = "1.27"

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets
}

output "fingerprint" {
  value = module.eks.cluster_tls_certificate_sha1_fingerprint
}

Steps to reproduce the behavior:

  1. terraform init
  2. terraform apply
  3. terraform plan (repeat until Terraform shows changes, usually happens within 2-3 attempts)

Expected behavior

No changes. Your infrastructure matches the configuration.

I also expect the root certificate for my region (9e99a48a9960b14926bb7f3b02e22da2b0ab7280) to be in the thumbprint list for the provider.

Actual behavior

terraform apply created a provider with 3 certificates:

  • "c2f78cf04b914dd263be010c902dab7c8b3d09c8"
  • "7a6fd3d1bd08cc771fc67335f276c245e11d7dbb"
  • "be73080d116f7841025c4c16a1702a6b9b40413e"

The root certificate is missing here, and attempting to use a service account with this provider will fail unless "9e99a48a9960b14926bb7f3b02e22da2b0ab7280" is included.

After changing nothing, I ran terraform plan. The first time there were no necessary changes.

The second time terraform plan was run, all of the certificates were changed, and the root certificate is now included:

Terraform will perform the following actions:

  # module.eks.aws_iam_openid_connect_provider.oidc_provider[0] will be updated in-place
  ~ resource "aws_iam_openid_connect_provider" "oidc_provider" {
        id              = "arn:aws:iam::******"
        tags            = {
            "Name" = "bug-report-eks-irsa"
        }
      ~ thumbprint_list = [
          - "c2f78cf04b914dd263be010c902dab7c8b3d09c8",
          - "7a6fd3d1bd08cc771fc67335f276c245e11d7dbb",
          - "be73080d116f7841025c4c16a1702a6b9b40413e",
          + "9e99a48a9960b14926bb7f3b02e22da2b0ab7280",
          + "06b25927c42a721631c1efd9431e648fa62e1e39",
          + "2ad974a775f73cbdbbd8f5ac3a49255fa8fb1f8c",
          + "394b2cdabaf21bd23cafa6d3b450b993ff3bbc4f",
        ]
        # (4 unchanged attributes hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.

Changes to Outputs:
  + fingerprint = "9e99a48a9960b14926bb7f3b02e22da2b0ab7280"

I did not apply this change by typing "no".

I ran terraform plan again, and the output is inconsistent again. This time the cluster_tls_certificate_sha1_fingerprint value changed, as did the proposed changes to the thumbprint_list.

I have tried using the "c2f78cf04b914dd263be010c902dab7c8b3d09c8" certificate but the cluster service accounts don't work. When I manually retrieve the thumbprint for the server, I get "9e99a48a9960b14926bb7f3b02e22da2b0ab7280", which is the only one that works for my service accounts.

Terraform will perform the following actions:

  # module.eks.aws_iam_openid_connect_provider.oidc_provider[0] will be updated in-place
  ~ resource "aws_iam_openid_connect_provider" "oidc_provider" {
        id              = "arn:aws:iam::******"
        tags            = {
            "Name" = "bug-report-eks-irsa"
        }
      ~ thumbprint_list = [
            # (1 unchanged element hidden)
            "7a6fd3d1bd08cc771fc67335f276c245e11d7dbb",
          - "be73080d116f7841025c4c16a1702a6b9b40413e",
          + "842d9303407fb4d818455bc344019ab88ff1f092",
        ]
        # (4 unchanged attributes hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.

Changes to Outputs:
  + fingerprint = "c2f78cf04b914dd263be010c902dab7c8b3d09c8"

The planning behavior isn't consistent. Another time when I cleared out the cache and re-applied, the initial deployment used the correct root certificate and then later went back to the "c2f78..." one. The root certificate seems to flip-flop between "9e99a..." and "c2f78..." but the other certificates are inconsistent.

Additional context

I'm not sure if it's relevant, but I am in a corporate environment with a custom root CA. Given that there aren't other similar issues, I am suspicious that this is involved in the root cause, but I am not sure why that would be since the corporate certificate is consistent and doesn't cause problems for other applications.

I recorded DEBUG logs for this scenario, and according to the logs the "9e99a48a9960b14926bb7f3b02e22da2b0ab7280" certificate is still detected as the thumbprint for the server, even when it is not included in the final thumbprint_list which is weird and seems like a logic bug somewhere.

@bryantbiggs
Copy link
Member

see #2732

@danielfrankcom
Copy link
Contributor Author

Hi @bryantbiggs I don't think #2732 will resolve my issue.

Please note that the CA fingerprint in my report is not consistent when I use terraform apply, so there would still be changes even if the CA fingerprint was the only one that was included.

I will create a PR to resolve #2732 since it will make my problem simpler to see (without the churn of the other fingerprints), but the real issue her is the CA fingerprint inconsistency.

@bryantbiggs
Copy link
Member

I don't believe the root CA will be changing - or rather, not changing that often. These are also not something that we can control here, thats up to EKS in terms of when they rotates certs

@danielfrankcom
Copy link
Contributor Author

I don't think the issue is the root CA is changing, I think the issue is that Terraform thinks it is changing. There seems to be some logic bug where the root CA thumbprint is not always included in the thumbprint_list, despite being detected correctly.

In the output that I provided with my report, you can see that the real CA thumbprint is "9e99a48a9960b14926bb7f3b02e22da2b0ab7280", yet intermittently Terraform does not include this in the thumbprint_list.

As mentioned at the bottom of my report, the debug logs show the correct "9e99a..." thumbprint for the server, even though the thumbprint does not make its way into the thumbprint_list for some reason.

@danielfrankcom
Copy link
Contributor Author

I have been able to reproduce the same behavior by using the individual resources from the module, so this may be an issue with one of the components rather than this module itself.

With the code below, I saw the correct thumbprints applied initially, but then the same incorrect thumbprints intermittently:

Changes to Outputs:
  + cluster_tls_certificate_sha1_fingerprint = "c2f78cf04b914dd263be010c902dab7c8b3d09c8"
  ~ fingerprints                             = [
      - "9e99a48a9960b14926bb7f3b02e22da2b0ab7280",
      - "06b25927c42a721631c1efd9431e648fa62e1e39",
      - "2ad974a775f73cbdbbd8f5ac3a49255fa8fb1f8c",
      - "394b2cdabaf21bd23cafa6d3b450b993ff3bbc4f",
      + "c2f78cf04b914dd263be010c902dab7c8b3d09c8",
      + "7a6fd3d1bd08cc771fc67335f276c245e11d7dbb",
      + "ed51abf0bfec9951a24e04d28c3a7ee3abf6c592",
    ]

Note the missing "9e99a48a9960b14926bb7f3b02e22da2b0ab7280" root CA thumbprint.

provider "aws" {
  region  = "ca-central-1"
  profile = "sandbox"
}

resource "aws_iam_role" "demo-cluster" {
  name = "terraform-eks-demo-cluster"

  assume_role_policy = <<POLICY
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "eks.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
POLICY
}


variable "cluster_name" {
  default = "terraform-eks-demo"
  type    = string
}

data "aws_availability_zones" "available" {}

resource "aws_vpc" "demo" {
  cidr_block = "10.0.0.0/16"

  tags = tomap({
    "Name"                                      = "terraform-eks-demo-node",
    "kubernetes.io/cluster/${var.cluster_name}" = "shared",
  })
}

resource "aws_subnet" "demo" {
  count = 2

  availability_zone       = data.aws_availability_zones.available.names[count.index]
  cidr_block              = "10.0.${count.index}.0/24"
  vpc_id                  = aws_vpc.demo.id
}


resource "aws_eks_cluster" "this" {
  name     = var.cluster_name
  role_arn = aws_iam_role.demo-cluster.arn

  vpc_config {
    subnet_ids         = aws_subnet.demo[*].id
  }
}

data "tls_certificate" "this" {
  url = aws_eks_cluster.this.identity[0].oidc[0].issuer
}

output "fingerprints" {
  value = data.tls_certificate.this.certificates[*].sha1_fingerprint
}

output "cluster_tls_certificate_sha1_fingerprint" {
  description = "The SHA1 fingerprint of the public key of the cluster's certificate"
  value       = data.tls_certificate.this.certificates[0].sha1_fingerprint
}

@danielfrankcom
Copy link
Contributor Author

After some more digging, it seems like the fingerprint churn is in fact caused by the corporate root CA that I mentioned.

Findings

Our network has a Zscaler instance running which intercepts HTTPS traffic for inspection. It's then repackaged with the Zscaler root CA fingerprint, which the corporate machines are configured to trust.

I ran the following command to continually monitor the certificate chain across multiple requests:

while true; do openssl s_client -showcerts -verify 5 -connect oidc.eks.ca-central-1.amazonaws.com:443 < /dev/null; sleep 1; done

I found that intermittently the Zscaler certificate chain would be returned by this command, which I think is causing the issue with the fingerprints in my EKS cluster.

When I manually generate the fingerprint of the Zscaler root CA, it comes out to "c2f78cf04b914dd263be010c902dab7c8b3d09c8", which was the problem fingerprint from my original report. Additionally, the Zscaler certificate chain contains 3 certificates, as opposed to the AWS chain which contains 4, hence the differences in the other certificates from the original bug report.

Next steps

Now I know what is causing the problem, but I don't have a clear path to solving it.

I'm not sure there is a way for the TLS provider to resolve the actual root CA consistently, since Zscaler is intercepting the traffic and messing with the certificates.

I can manually configure the correct fingerprint for the OIDC provider, but the intermittent churn in the automatic module certificates will still occur.

Would it be reasonable to add an option to the module to override the fingerprints completely? Any attempt to automatically resolve the certificates in my use case will result in churn at planning/applying time from what I can tell.

@bryantbiggs bryantbiggs linked a pull request Oct 6, 2023 that will close this issue
1 task
@danielfrankcom
Copy link
Contributor Author

danielfrankcom commented Oct 6, 2023

I'm happy to put together a pull request if something like this would be acceptable:

override_oidc_thumbprints = true
custom_oidc_thumbprints   = [
  "9e99a48a9960b14926bb7f3b02e22da2b0ab7280"
]

Where the thumbprint_list would be configured according to these rules:

# If override_oidc_thumbprints=true then
thumbprint_list = concat(data.tls_certificate.this[0].certificates[*].sha1_fingerprint, var.custom_oidc_thumbprints)

# If override_oidc_thumbprints=false then
thumbprint_list = var.custom_oidc_thumbprints

This UX is my first thought, but there may be clearer approaches.

@danielfrankcom
Copy link
Contributor Author

danielfrankcom commented Oct 6, 2023

I suppose since #2769 is merged, there is only 1 certificate to configure. That might make the wording of the configuration a bit easier.

Something like this maybe?

include_oidc_ca_thumbprint = false
custom_oidc_thumbprints    = [
  "9e99a48a9960b14926bb7f3b02e22da2b0ab7280"
]

Copy link

github-actions bot commented Nov 6, 2023

This issue has been automatically marked as stale because it has been open 30 days
with no activity. Remove stale label or comment or this issue will be closed in 10 days

@github-actions github-actions bot added the stale label Nov 6, 2023
@danielfrankcom
Copy link
Contributor Author

This is still an issue, #2778 is waiting for review and should resolve this.

@github-actions github-actions bot removed the stale label Nov 7, 2023
@globalpayments-martinvanderboon

As a fellow ZScaler user with this exact issue I wold like to see #2778 merged also.

@antonbabenko
Copy link
Member

This issue has been resolved in version 19.20.0 🎉

@parvez99
Copy link

parvez99 commented Nov 20, 2023

I'm still seeing this issue when running terraoform plan with the latest terraform-aws-eks module version : 19.20.0

Terraform version : Terraform v1.5.6
EKS version : 1.28
Platform Version: eks-3
Module Version : 19.20.0

(EKS platform version was automatically updated by EKS in the background a few days back)

-- Output I see on running terraform plan

module.eks.module.eks.aws_iam_openid_connect_provider.oidc_provider[0] will be updated in-place

~ resource "aws_iam_openid_connect_provider" "oidc_provider" {
id = "arn:aws:iam::redacted"
~ tags = {
"Cluster" = "test"
~ "Contact" = "redacted"
}
~ tags_all = {
~ "Contact" = "redacted" -> "redacted"
# (8 unchanged elements hidden)
}
~ thumbprint_list = [
- "9e99a48a9960b14926bb7f3b02e22da2b0ab7280",
- "06b25927c42a721631c1efd9431e648fa62e1e39",
- "414a2060b738c635cc7fc243e052615592830c53",
- "aaa68bb211d468db8a8a19561ccba2e4043dcc80",
] -> (known after apply)
# (3 unchanged attributes hidden)
}


I tried clearing .terraform modules but I still see the thumbprint list when running terraform plan


I manually checked the thumbprint of the cluster and it looks to be the same value listed in the terraform plan output

`$> printf '{"thumbprint": "%s"}\n' $THUMBPRINT

{"thumbprint": "9E99A48A9960B14926BB7F3B02E22DA2B0AB7280"}
`
and these details look valid too

            "is_ca": true,
            "issuer": "OU=Starfield Class 2 Certification Authority,O=Starfield Technologies\\, Inc.,C=US",
            "not_after": "2034-06-28T17:39:16Z",
            "not_before": "2009-09-02T00:00:00Z",
            "public_key_algorithm": "RSA",
            "serial_number": "redacted",
            "sha1_fingerprint": "9e99a48a9960b14926bb7f3b02e22da2b0ab7280",
            "signature_algorithm": "SHA256-RSA",
            "subject": "CN=Starfield Services Root Certificate Authority - G2,O=Starfield Technologies\\, Inc.,L=Scottsdale,ST=Arizona,C=US",
            "version": 3

Am I missing something ? Is this an issue if I still go ahead an run : terraform apply ?
I'm not sure if this could break something , please provide some inputs, thanks.

Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants