Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to v1.0.4 breaks kubelet config using EC2NodeClass's spec.kubelet.clusterDNS #7235

Open
schahal opened this issue Oct 16, 2024 · 9 comments
Labels
question Further information is requested

Comments

@schahal
Copy link

schahal commented Oct 16, 2024

Description

Observed Behavior:

After upgrading from karpenter-provider-aws:v1.0.3 to v1.0.4, the new Kubernetes nodes that Karpenter provisions does not have the EC2NodeClass's spec.kubelet.clusterDNS value in its /etc/kubernetes/kubelet/config.

For example, one of our EC2NodeClass's (e.g., default-ebs) looks like this:

Click to view EC2NodeClass
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  annotations:
    karpenter.k8s.aws/ec2nodeclass-hash: "<redacted>"
    karpenter.k8s.aws/ec2nodeclass-hash-version: v3
    karpenter.sh/stored-version-migrated: "true"
  finalizers:
  - karpenter.k8s.aws/termination
  generation: 18
  name: default-ebs
spec:
  amiFamily: Bottlerocket
  amiSelectorTerms:
  - alias: [email protected]
  ...
  kubelet:
    clusterDNS:
    - 169.254.20.11
    evictionHard:
      imagefs.available: 30%
      imagefs.inodesFree: 15%
      ...
    evictionSoft:
      ...

Notice, spec.kubelet.clusterDNS: [169.254.20.11]

However, on the new node, we see:

$ grep -A1 "clusterDNS" /.bottlerocket/rootfs/etc/kubernetes/kubelet/config
clusterDNS:
- 172.20.0.10
Also confirmed the NodePool the node is part of refers to the same EC2NodeClass as above
      Node Class Ref:
        Group:  karpenter.k8s.aws
        Kind:   EC2NodeClass
        Name:   default-ebs

Expected Behavior:

After reverting back to v1.0.3, we see the correct (expected) value on a new node:

$ grep -A1 "clusterDNS" /.bottlerocket/rootfs/etc/kubernetes/kubelet/config
clusterDNS:
- 169.254.20.11

Reproduction Steps (Please include YAML):

See above:

  1. Run karpenter-provider-aws:v1.0.3
  2. Make sure to have an EC2NodeClass which includes a custom spec.kubelet.clusterDNS
  3. Confirm /etc/kubernetes/kubelet/config on node has that same value
  4. Upgrade to karpenter-provider-aws:v1.0.4
  5. Repeat steps (2)-(3) and you'll see that the config file on the new node does not have that value

Versions:

  • Chart Image Version: v1.0.4
  • Kubernetes Version (kubectl version): Server Version: v1.30.4-eks-a737599
  • Node version: bottlerocket v1.24.1
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@schahal schahal added bug Something isn't working needs-triage Issues that need to be triaged labels Oct 16, 2024
@schahal
Copy link
Author

schahal commented Oct 16, 2024

A bit more clarity:

Prior to v1.0.4, Karpenter was able to take spec.kubelet.clusterDNS of the EC2NodeClass and overlay the kubelet config of that value.

With v1.0.4+, it's keeping the default value we pass into the node user-data:

[settings]
...
[settings.kubernetes]
...
cluster-dns-ip = '172.20.0.10'
max-pods = 29
...

So it's not just clusterDNS (e.g., maxPods is kept at 29 also).

Looking at v1.0.4 release notes, there was a maxPods-related commit (#7020), but does that vibe with the symptoms of this issue?

@engedaam
Copy link
Contributor

Do you have compatibility.karpenter.sh/v1beta1-kubelet-conversion annotation in any of your nodepools? The compatibility.karpenter.sh/v1beta1-kubelet-conversion NodePool annotation takes precedence over the EC2NodeClass Kubelet configuration when launching nodes

@schahal
Copy link
Author

schahal commented Oct 17, 2024

Indeed, looking at the nodepools, they have annotations like this (though not explicitly added (edit: by us), so probably by Karpenter itself when we migrated to v1.0):

compatibility.karpenter.sh/v1beta1-kubelet-conversion: {"kubeReserved":{"cpu":"90m","ephemeral-storage":"1Gi","memory":"1465Mi"}}
compatibility.karpenter.sh/v1beta1-nodeclass-reference: {"name":"default-ebs"}

Couple questions:

  • Is the suggestion here that we remove those annotations and retry the upgrade?
    • The values of those annotations seem like unrelated configs (?)
    • it also seems like the docs suggest we remove them during an eventual jump to v1.1.x (here we're going to v1.0.4)
  • Wondering why it's only breaking when we jump from v1.0.3 to v1.0.4 (and not, for example, when we jumped from v1.0.2 to v1.0.3?

@engedaam
Copy link
Contributor

engedaam commented Oct 17, 2024

Is the suggestion here that we remove those annotations and retry the upgrade?

  • The values of those annotations seem like unrelated configs (?)
  • it also seems like the docs suggest we remove them during an eventual jump to v1.1.x (here we're going to v1.0.4)

The nodepool compatibility.karpenter.sh/v1beta1-kubelet-conversion will take precedence over the over the EC2NodeClass Kubelet configuration. At v1.1 the expectation is that the annotation is removed by customers.

The compatibility.karpenter.sh/v1beta1-kubelet-conversion NodePool annotation takes precedence over the EC2NodeClass Kubelet configuration when launching nodes. Remove the kubelet-configuration annotation (compatibility.karpenter.sh/v1beta1-kubelet-conversion) from your NodePools once you have migrated kubelet from the NodePool to the EC2NodeClass.

ref: https://karpenter.sh/docs/upgrading/v1-migration/#before-upgrading-to-v11

Wondering why it's only breaking when we jump from v1.0.3 to v1.0.4 (and not, for example, when we jumped from v1.0.2 to v1.0.3?

We had a bug that any new nodeclaim that was launched used the EC2NodeClass kubelet configuration, without considering the kubelet compatibility annotation. This fix was merged in at 1.0.2: kubernetes-sigs/karpenter#1667. Are you able to share node that were created on 1.0.3 with their nodepools and ec2nodeclass kubelet configurations?

@engedaam engedaam added question Further information is requested and removed bug Something isn't working needs-triage Issues that need to be triaged labels Oct 17, 2024
@schahal
Copy link
Author

schahal commented Oct 17, 2024

Are you able to share node that were created on 1.0.3 with their nodepools and ec2nodeclass kubelet configurations?

Yes, here's a slightly anonymized share of that:

$ kubectl describe nodeclaim foo-bar-9lmsn
Name:         foo-bar-9lmsn
Namespace:
Labels:       <redacted>
              karpenter.sh/nodepool=foo-bar
              <redacted>
Annotations:  compatibility.karpenter.k8s.aws/cluster-name-tagged: true
              compatibility.karpenter.k8s.aws/kubelet-drift-hash: <redacted>
              karpenter.k8s.aws/ec2nodeclass-hash: <redacted>
              karpenter.k8s.aws/ec2nodeclass-hash-version: v3
              karpenter.k8s.aws/tagged: true
              karpenter.sh/nodepool-hash: <redacted>
              karpenter.sh/nodepool-hash-version: v3
API Version:  karpenter.sh/v1
Kind:         NodeClaim
Metadata:
  Creation Timestamp:  2024-10-17T11:11:31Z
  Finalizers:
    karpenter.sh/termination
  Generate Name:  foo-bar-
  Generation:     1
  Owner References:
    API Version:           karpenter.sh/v1
    Block Owner Deletion:  true
    Kind:                  NodePool
    Name:                  foo-bar
    UID:                   <redacted>
  Resource Version:        <redacted>
  UID:                     <redacted>
Spec:
  Expire After:  720h
  Node Class Ref:
    Group:  karpenter.k8s.aws
    Kind:   EC2NodeClass
    Name:   default-foobar
  Requirements:
    Key:       karpenter.sh/nodepool
    Operator:  In
    Values:
      foo-bar
    <redacted>
  Resources:
    Requests:
      Cpu:     1235m
      Memory:  4523Mi
      Pods:    18
  Taints:
    <redacted>
$ kubectl describe nodepool foo-bar
Name:         foo-bar
Namespace:
Labels:       <redacted>
Annotations:  compatibility.karpenter.sh/v1beta1-kubelet-conversion: {"kubeReserved":{"cpu":"90m","ephemeral-storage":"1Gi","memory":"1465Mi"}}
              compatibility.karpenter.sh/v1beta1-nodeclass-reference: {"name":"default-foobar"}
              karpenter.sh/nodepool-hash: <redacted>
              karpenter.sh/nodepool-hash-version: v3
              karpenter.sh/stored-version-migrated: true
API Version:  karpenter.sh/v1
Kind:         NodePool
Metadata:
  Creation Timestamp:  2024-07-25T00:20:04Z
  Generation:          10
  Resource Version:    <redacted>
  UID:                 <redacted>
Spec:
  Disruption:
    Budgets:
      Nodes:  5%
      Reasons:
        Drifted
      Nodes:               25%
    Consolidate After:     5m
    Consolidation Policy:  WhenEmptyOrUnderutilized
  Template:
    Metadata:
      Labels:
        <redacted>
    Spec:
      Expire After:  720h
      Node Class Ref:
        Group:  karpenter.k8s.aws
        Kind:   EC2NodeClass
        Name:   default-foobar
      Requirements:
        <redacted>
      Startup Taints:
      Taints:
        <redacted>
$ kubectl describe ec2nodeclass default-foobar
Name:         default-foobar
Namespace:
Labels:       <redacted>
Annotations:  karpenter.k8s.aws/ec2nodeclass-hash: <redacted>
              karpenter.k8s.aws/ec2nodeclass-hash-version: v3
              karpenter.sh/stored-version-migrated: true
API Version:  karpenter.k8s.aws/v1
Kind:         EC2NodeClass
Metadata:
  Creation Timestamp:  2024-07-21T22:03:32Z
  Finalizers:
    karpenter.k8s.aws/termination
  Generation:        18
  Resource Version:  <redacted>
  UID:               <redacted>
Spec:
  Ami Family:  Bottlerocket
  Ami Selector Terms:
    Alias:  [email protected]
  Block Device Mappings:
    Device Name:  /dev/xvda
    Ebs:
      <redacted>
  Instance Profile:           <redacted>
  Kubelet:
    Cluster DNS:
      169.254.20.11
    Eviction Hard:
      imagefs.available:              30%
      imagefs.inodesFree:             15%
      memory.available:               5%
      nodefs.available:               10%
      nodefs.inodesFree:              10%
      pid.available:                  30%
    Image GC High Threshold Percent:  75
    Image GC Low Threshold Percent:   45
    Max Pods:                         110
  <redacted>
  User Data:      [settings.kubernetes]
<redacted... no cluster-dns-ip stuff anywhere here>
[settings.kernel.sysctl]
<redacted>
[settings.bootstrap-containers.setup-runtime-storage]
<redacted>

To re-iterate the behavior for complete understanding:

  • The above node/claim (launched by v1.0.3) will have kubelet Cluster DNS change we specify in the EC2NodeClass above in the node's kubelet config
  • When we upgrade karpenter to v1.0.4+, the node launched by it won't have the Cluster DNS change we specify in the EC2NodeClass above in the node's kubelet config

@schahal
Copy link
Author

schahal commented Oct 17, 2024

We had a bug that any new nodeclaim that was launched used the EC2NodeClass kubelet configuration, without considering the kubelet compatibility annotation. This fix was merged in at 1.0.2: kubernetes-sigs/karpenter#1667

I believe this is the issue (and why we only see this issue when we upgrade to karpenter-provider-aws:v1.0.4)

Looking at all commit differences from karpenter-provider-aws v1.0.3 and v1.0.4, it looks like the fix that was merged in kubernetes-sigs/karpenter#1667 was only pulled finally into karpenter-provider-aws:v1.0.4:

Screenshot 2024-10-17 at 11 27 39 AM

If I'm reading that correctly, what does that mean? Do karpenter users who upgrade from karpenter-provider-aws:v1.0.3 to karpenter-provider-aws:v1.0.4 need to manually remove that kubelet compatibility annotation from all their NodePools (regardless of what value that annotation has)?

@engedaam
Copy link
Contributor

If I'm reading that correctly, what does that mean? Do karpenter users who upgrade from karpenter-provider-aws:v1.0.3 to karpenter-provider-aws:v1.0.4 need to manually remove that kubelet compatibility annotation from all their NodePools (regardless of what value that annotation has)?

Yes, as mentioed as part of the upgrade guide. You will need to drop the kubelet compatibility annotation from all their NodePools. However, make sure the kubelet configuration values are the same to avoid causing unexpected drift.

Kubelet Configuration: If you have multiple NodePools pointing to the same EC2NodeClass that have different kubeletConfigurations, then you have to manually add more EC2NodeClasses and point their NodePools to them. This will induce drift and you will have to roll your cluster. If you have multiple NodePools pointing to the same EC2NodeClass, but they have the same configuration, then you can proceed with the migration without having drift or having any additional NodePools or EC2NodeClasses configured.

ref: https://karpenter.sh/docs/upgrading/v1-migration/#before-upgrading-to-v11

Apologize, I should have been clearer since kubernets-sigs/karpetner and karpenter-provider-aws don't share the same versions.kubernets-sigs/karpetner v1.0.2 would corresponded with karpenter-provider-aws v1.04.

@schahal
Copy link
Author

schahal commented Oct 21, 2024

No problem, thanks for confirming that! (I think a room for improvement is to add to karpenter-provider-aws Release Notes whenever a kubernets-sigs/karpentner version is updated within the aws provider, for better visibility)

Also, another point of clarity, the docs (ref: https://karpenter.sh/docs/upgrading/v1-migration/#before-upgrading-to-v11), claim we should remove the annotations before upgrading to v1.1 <-- does this actually mean v1.0.1 (I don't see anything at v1.1 just yet, right)?

Regardless, I'll remove the annotations this week and re-attempt an upgrade to v1.0.4+ ... will keep this ticket updated.

@schahal
Copy link
Author

schahal commented Oct 24, 2024

I'll remove the annotations this week and re-attempt an upgrade to v1.0.4+ ... will keep this ticket updated.

This seems to have worked.

We can close this issue if you guys agree. (still not sure if current docs claiming we should remove the annotations before upgrading to v1.1 actually mean v1.0.1 instead?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants