Pod gets evicted still even with "karpenter.sh/do-not-disrupt" annotation #1167

chenfeilee · 2024-04-04T03:23:15Z

Description

Observed Behavior:
Pod with "karpenter.sh/do-not-disrupt" annotation still get evicted - seems to happen quite frequently but doesn't seem like it happens all the time. Below is the number of such occurrences in the past 2 weeks

Some examples of Karpenter evicting pods despite claiming the same pods to be undisruptable just moments ago.

Apr 3, 2024 @ 16:10:41.000: Cannot disrupt Node: Pod "mesh-west/runner-zjioycxf-project-14388-concurrent-0lvg45" has "karpenter.sh/do-not-disrupt" annotation
Apr 3, 2024 @ 16:10:41.000: Cannot disrupt NodeClaim: Pod "mesh-west/runner-zjioycxf-project-14388-concurrent-0lvg45" has "karpenter.sh/do-not-disrupt" annotation
...
(Events from Datadog) Apr 3, 4:10:41 pm: Events from the Pod mesh-west/runner-zjioycxf-project-14388-concurrent-0lvg45 - Evicted pod - Events emitted by the karpenter

Apr 2, 2024 @ 13:09:04.000: Cannot disrupt Node: Pod "mesh-west/runner-su7-c7kj-project-14388-concurrent-0qc8j7" has "karpenter.sh/do-not-disrupt" annotation
Apr 2, 2024 @ 13:09:04.000: Cannot disrupt NodeClaim: Pod "mesh-west/runner-su7-c7kj-project-14388-concurrent-0qc8j7" has "karpenter.sh/do-not-disrupt" annotation
...
(Events from Datadog) Apr 2, 1:09:05 pm: Events from the Pod mesh-west/runner-su7-c7kj-project-14388-concurrent-0qc8j7 - Evicted pod - Events emitted by the karpenter

Implication of this is our Gitlab runner jobs getting killed due to eviction - despite the runner pod having the "karpenter.sh/do-not-disrupt" annotation set e.g.

Expected Behavior:
Pod with "karpenter.sh/do-not-disrupt" annotation shouldn't be evicted

Reproduction Steps (Please include YAML):

Versions:

Chart Version: 0.35.2
Kubernetes Version (kubectl version): EKS v1.23.17-eks-508b6b3, v1.27.10-eks-508b6b3

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

njtran · 2024-04-05T00:45:53Z

/assign @jmdeal

jmdeal · 2024-04-05T01:23:12Z

Are you able to share full Karpenter logs and NodePool spec? It's possible that you're hitting the consolidation race condition called out by #651. This sequence seems likely:

Karpenter makes a consolidation decision and finishes its 15 second validation period. At this point, no do-not-disturb pods are scheduled.
Karpenter checks for do-not-disturb pods and begins the final validation simulation. This is the beginning of the race window.
do-not-disturb pods are scheduled to the node, other parallel disruption runs emit the "Cannot disrupt node" events.
The simulation finishes, the node is tainted and Karpenter begins draining the node.

Are you able to check the scheduled vs evicted times for the pods so we can see if they line up with this theory?

TimoVink · 2024-04-05T21:22:42Z

I may be running into the same thing. Not used to digging through k8s/Karpenter audit logs, but here is what I was able to dig up:

Pod Spec

NOTE: I manually removed some fields from this object for brevity and privacy. Let me know if I was overzealous in that respect.

  apiVersion: v1
  kind: Pod
  metadata:
    annotations:
      karpenter.sh/do-not-evict: true
    creationTimestamp: 2024-04-05T17:13:10.0000000Z
    finalizers:
      - batch.kubernetes.io/job-tracking
    generateName: dagster-run-076c8f10-b1ec-4fcf-a81e-f7366fb60c94-
    labels:
      app.kubernetes.io/component: run_worker
      app.kubernetes.io/instance: dagster
      app.kubernetes.io/name: dagster
      app.kubernetes.io/part-of: dagster
      app.kubernetes.io/version: 1.6.11
      batch.kubernetes.io/controller-uid: 69e62c7e-7301-434d-8a2d-de1640a8659b
      batch.kubernetes.io/job-name: dagster-run-076c8f10-b1ec-4fcf-a81e-f7366fb60c94
      controller-uid: 69e62c7e-7301-434d-8a2d-de1640a8659b
      dagster/code-location: core
      dagster/job: ASSET_JOB_1
      dagster/run-id: 076c8f10-b1ec-4fcf-a81e-f7366fb60c94
      job-name: dagster-run-076c8f10-b1ec-4fcf-a81e-f7366fb60c94
    name: dagster-run-076c8f10-b1ec-4fcf-a81e-f7366fb60c94-tjmm7
    namespace: dagster
    ownerReferences:
      - apiVersion: batch/v1
        blockOwnerDeletion: true
        controller: true
        kind: Job
        name: dagster-run-076c8f10-b1ec-4fcf-a81e-f7366fb60c94
        uid: 69e62c7e-7301-434d-8a2d-de1640a8659b
    resourceVersion: 299078016
    uid: 8cdd179a-c970-4600-87c6-46c51edf94da
  spec:
    automountServiceAccountToken: true
    containers:
      - args:
          - dagster
          - api
          - execute_run
        image: myaccountidhere.dkr.ecr.eu-west-1.amazonaws.com/dagster-repository-core:1.0.1509294
        imagePullPolicy: IfNotPresent
        name: dagster
        resources:
          requests:
            cpu: 500m
            memory: 256Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
          - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
            name: kube-api-access-ls9cl
            readOnly: true
          - mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
            name: aws-iam-token
            readOnly: true
    dnsPolicy: ClusterFirst
    enableServiceLinks: true
    nodeSelector:
      karpenter.sh/capacity-type: on-demand
    preemptionPolicy: PreemptLowerPriority
    priority: 0
    restartPolicy: Never
    schedulerName: default-scheduler
    securityContext: {}
    serviceAccount: dagster-jobs-core
    serviceAccountName: dagster-jobs-core
    terminationGracePeriodSeconds: 30
    tolerations:
      - effect: NoExecute
        key: node.kubernetes.io/not-ready
        operator: Exists
        tolerationSeconds: 300
      - effect: NoExecute
        key: node.kubernetes.io/unreachable
        operator: Exists
        tolerationSeconds: 300

This Pod was evicted very shortly after being scheduled:

Pod Condition Transitions

- lastProbeTime: 
  lastTransitionTime: 2024-04-05T17:13:10.0000000Z
  status: True
  type: Initialized
- lastProbeTime: 
  lastTransitionTime: 2024-04-05T17:13:10.0000000Z
  status: True
  type: PodScheduled
- lastProbeTime: 
  lastTransitionTime: 2024-04-05T17:13:16.0000000Z
  message: 'Eviction API: evicting'
  reason: EvictionByEvictionAPI
  status: True
  type: DisruptionTarget
- lastProbeTime: 
  lastTransitionTime: 2024-04-05T17:13:47.0000000Z
  status: False
  type: PodReadyToStartContainers
- lastProbeTime: 
  lastTransitionTime: 2024-04-05T17:13:47.0000000Z
  reason: PodFailed
  status: False
  type: Ready
- lastProbeTime: 
  lastTransitionTime: 2024-04-05T17:13:47.0000000Z
  reason: PodFailed
  status: False
  type: ContainersReady

And based on the following audit log it does seem clear to me that Karpenter initiated this eviction:

Eviction Audit Event

kind: Event
apiVersion: audit.k8s.io/v1
level: RequestResponse
auditID: b93e7d92-3fa7-4a11-99ec-4789388d5d5e
stage: ResponseComplete
requestURI: /api/v1/namespaces/dagster/pods/dagster-run-076c8f10-b1ec-4fcf-a81e-f7366fb60c94-tjmm7/eviction
verb: create
user:
  username: system:serviceaccount:karpenter:karpenter
  uid: 60b39c75-c897-4ee4-8b35-23a191b89833
  groups:
    - system:serviceaccounts
    - system:serviceaccounts:karpenter
    - system:authenticated
  extra:
    authentication.kubernetes.io/pod-name:
      - karpenter-c4cff56cf-nj7fs
    authentication.kubernetes.io/pod-uid:
      - 6f68a05a-e24d-4c89-9b15-08fc77d04034
sourceIPs:
  - # snip
userAgent: karpenter/v0.34.0
objectRef:
  resource: pods
  namespace: dagster
  name: dagster-run-076c8f10-b1ec-4fcf-a81e-f7366fb60c94-tjmm7
  apiVersion: v1
  subresource: eviction
responseStatus:
  metadata: {}
  status: Success
  code: 201
requestObject:
  kind: Eviction
  apiVersion: policy/v1
  metadata:
    name: dagster-run-076c8f10-b1ec-4fcf-a81e-f7366fb60c94-tjmm7
    namespace: dagster
    creationTimestamp: 
responseObject:
  kind: Status
  apiVersion: v1
  metadata: {}
  status: Success
  code: 201
requestReceivedTimestamp: 2024-04-05T17:13:16.8145040Z
stageTimestamp: 2024-04-05T17:13:16.8388540Z
annotations:
  authorization.k8s.io/decision: allow
  authorization.k8s.io/reason: 'RBAC: allowed by ClusterRoleBinding "karpenter-core" of ClusterRole "karpenter-core" to ServiceAccount "karpenter/karpenter"'

Some extra details:

I'm using karpenter.sh/do-not-evict but that annotation should still work, right?
I'm using Karpenter v0.34.0
I'm using k8s 1.29 on AWS EKS
I'm fairly confident this is new behaviour. We recently updated from v0.32.1 -> v0.34.0 which is possibly when it started occurring
The (very small sample) of failed Pods I've been investigating were all evicted right after being scheduled

jmdeal · 2024-04-08T19:34:02Z

I think what you've described lines up with the race I described, though this should have existed on v0.32.x as well. I'm speculating that we're seeing this appear more frequently on v0.34+ since it introduced parallel disruption. By increasing the number of consolidation decisions that are being made in a given period of time, we've also increased the chance of this race occurring. It's also going to be dependent on the scale of the cluster since longer scheduling simulation times widen the window for this to occur. We are prioritizing a fix here, I'm hoping to get a PR out in the next couple of days.

chenfeilee · 2024-04-11T03:58:59Z

hi @jmdeal sorry for delayed response. I had reached out to our AWS SA/TAM in our organization for the same issue and had also filed a AWS support case. I believe they already got in-touch with you guys with some additional k8s logs/events that I have collated into a document too (sorry for not being able to share the document here as I have not sanitized the information in that)

Looking forward to the release of the fix 🙇

TimoVink · 2024-04-19T00:06:00Z

Thanks @jmdeal for looking into this, appreciate the quick turnaround on that PR!

Wondering if you have any suggestions for an interim workaround.

I assume the race condition applies equally regardless of the reason for consolidation (i.e., it applies to both WhenUnderutilized and WhenEmpty?). Does it also apply to nodes being removed due to "expiration"?

I was thinking perhaps I could set consolidationPolicy: Never and expireAfter: 24h or something similar to avoid my Jobs failing intermittently, while still ensuring the cluster is generally right-sized (though not as efficiently as it would be using consolidation).

chenfeilee added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 4, 2024

chenfeilee mentioned this issue Apr 4, 2024

Karpenter disrupting pods with karpenter.sh/do-not-disrupt: "true" annotation aws/karpenter-provider-aws#5786

Closed

k8s-ci-robot assigned jmdeal Apr 5, 2024

njtran removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 5, 2024

njtran mentioned this issue Apr 5, 2024

Isn't working do-not-disrupt on pod #1168

Closed

njtran mentioned this issue Apr 8, 2024

Nodes not tainted before deletion aws/karpenter-provider-aws#5778

Closed

jmdeal mentioned this issue Apr 10, 2024

fix: address disruption taint race condition #1180

Merged

gnuletik mentioned this issue Apr 22, 2024

fix: avoid race condition for do-not-disrupt pods #1198

Closed

k8s-ci-robot closed this as completed in #1180 Apr 24, 2024

tzneal mentioned this issue Jun 10, 2024

Keda scaled job pods are prematurely terminated by Karpenter aws/karpenter-provider-aws#6337

Closed

zaldnoay mentioned this issue Jun 26, 2024

Karpenter evicts pods having do-not-disrupt annotation aws/karpenter-provider-aws#6407

Closed

Nuru mentioned this issue Jul 1, 2024

Please document what makes a node considered "empty" for purposes of disruption/consolidation aws/karpenter-provider-aws#6253

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod gets evicted still even with "karpenter.sh/do-not-disrupt" annotation #1167

Pod gets evicted still even with "karpenter.sh/do-not-disrupt" annotation #1167

chenfeilee commented Apr 4, 2024 •

edited

Loading

njtran commented Apr 5, 2024

jmdeal commented Apr 5, 2024

TimoVink commented Apr 5, 2024 •

edited

Loading

jmdeal commented Apr 8, 2024

chenfeilee commented Apr 11, 2024

TimoVink commented Apr 19, 2024

Pod gets evicted still even with "karpenter.sh/do-not-disrupt" annotation #1167

Pod gets evicted still even with "karpenter.sh/do-not-disrupt" annotation #1167

Comments

chenfeilee commented Apr 4, 2024 • edited Loading

Description

njtran commented Apr 5, 2024

jmdeal commented Apr 5, 2024

TimoVink commented Apr 5, 2024 • edited Loading

jmdeal commented Apr 8, 2024

chenfeilee commented Apr 11, 2024

TimoVink commented Apr 19, 2024

chenfeilee commented Apr 4, 2024 •

edited

Loading

TimoVink commented Apr 5, 2024 •

edited

Loading