Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod gets evicted still even with "karpenter.sh/do-not-disrupt" annotation #1167

Closed
chenfeilee opened this issue Apr 4, 2024 · 6 comments · Fixed by #1180
Closed

Pod gets evicted still even with "karpenter.sh/do-not-disrupt" annotation #1167

chenfeilee opened this issue Apr 4, 2024 · 6 comments · Fixed by #1180
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@chenfeilee
Copy link

chenfeilee commented Apr 4, 2024

Description

Observed Behavior:
Pod with "karpenter.sh/do-not-disrupt" annotation still get evicted - seems to happen quite frequently but doesn't seem like it happens all the time. Below is the number of such occurrences in the past 2 weeks

image

Some examples of Karpenter evicting pods despite claiming the same pods to be undisruptable just moments ago.

Apr 3, 2024 @ 16:10:41.000: Cannot disrupt Node: Pod "mesh-west/runner-zjioycxf-project-14388-concurrent-0lvg45" has "karpenter.sh/do-not-disrupt" annotation
Apr 3, 2024 @ 16:10:41.000: Cannot disrupt NodeClaim: Pod "mesh-west/runner-zjioycxf-project-14388-concurrent-0lvg45" has "karpenter.sh/do-not-disrupt" annotation
...
(Events from Datadog) Apr 3, 4:10:41 pm: Events from the Pod mesh-west/runner-zjioycxf-project-14388-concurrent-0lvg45 - Evicted pod - Events emitted by the karpenter
Apr 2, 2024 @ 13:09:04.000: Cannot disrupt Node: Pod "mesh-west/runner-su7-c7kj-project-14388-concurrent-0qc8j7" has "karpenter.sh/do-not-disrupt" annotation
Apr 2, 2024 @ 13:09:04.000: Cannot disrupt NodeClaim: Pod "mesh-west/runner-su7-c7kj-project-14388-concurrent-0qc8j7" has "karpenter.sh/do-not-disrupt" annotation
...
(Events from Datadog) Apr 2, 1:09:05 pm: Events from the Pod mesh-west/runner-su7-c7kj-project-14388-concurrent-0qc8j7 - Evicted pod - Events emitted by the karpenter

Implication of this is our Gitlab runner jobs getting killed due to eviction - despite the runner pod having the "karpenter.sh/do-not-disrupt" annotation set e.g.
image

Expected Behavior:
Pod with "karpenter.sh/do-not-disrupt" annotation shouldn't be evicted

Reproduction Steps (Please include YAML):

Versions:

  • Chart Version: 0.35.2
  • Kubernetes Version (kubectl version): EKS v1.23.17-eks-508b6b3, v1.27.10-eks-508b6b3
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@chenfeilee chenfeilee added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 4, 2024
@njtran
Copy link
Contributor

njtran commented Apr 5, 2024

/assign @jmdeal

@njtran njtran removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 5, 2024
@jmdeal
Copy link
Member

jmdeal commented Apr 5, 2024

Are you able to share full Karpenter logs and NodePool spec? It's possible that you're hitting the consolidation race condition called out by #651. This sequence seems likely:

  • Karpenter makes a consolidation decision and finishes its 15 second validation period. At this point, no do-not-disturb pods are scheduled.
  • Karpenter checks for do-not-disturb pods and begins the final validation simulation. This is the beginning of the race window.
  • do-not-disturb pods are scheduled to the node, other parallel disruption runs emit the "Cannot disrupt node" events.
  • The simulation finishes, the node is tainted and Karpenter begins draining the node.

Are you able to check the scheduled vs evicted times for the pods so we can see if they line up with this theory?

@TimoVink
Copy link

TimoVink commented Apr 5, 2024

I may be running into the same thing. Not used to digging through k8s/Karpenter audit logs, but here is what I was able to dig up:

Pod Spec

NOTE: I manually removed some fields from this object for brevity and privacy. Let me know if I was overzealous in that respect.

  apiVersion: v1
  kind: Pod
  metadata:
    annotations:
      karpenter.sh/do-not-evict: true
    creationTimestamp: 2024-04-05T17:13:10.0000000Z
    finalizers:
      - batch.kubernetes.io/job-tracking
    generateName: dagster-run-076c8f10-b1ec-4fcf-a81e-f7366fb60c94-
    labels:
      app.kubernetes.io/component: run_worker
      app.kubernetes.io/instance: dagster
      app.kubernetes.io/name: dagster
      app.kubernetes.io/part-of: dagster
      app.kubernetes.io/version: 1.6.11
      batch.kubernetes.io/controller-uid: 69e62c7e-7301-434d-8a2d-de1640a8659b
      batch.kubernetes.io/job-name: dagster-run-076c8f10-b1ec-4fcf-a81e-f7366fb60c94
      controller-uid: 69e62c7e-7301-434d-8a2d-de1640a8659b
      dagster/code-location: core
      dagster/job: ASSET_JOB_1
      dagster/run-id: 076c8f10-b1ec-4fcf-a81e-f7366fb60c94
      job-name: dagster-run-076c8f10-b1ec-4fcf-a81e-f7366fb60c94
    name: dagster-run-076c8f10-b1ec-4fcf-a81e-f7366fb60c94-tjmm7
    namespace: dagster
    ownerReferences:
      - apiVersion: batch/v1
        blockOwnerDeletion: true
        controller: true
        kind: Job
        name: dagster-run-076c8f10-b1ec-4fcf-a81e-f7366fb60c94
        uid: 69e62c7e-7301-434d-8a2d-de1640a8659b
    resourceVersion: 299078016
    uid: 8cdd179a-c970-4600-87c6-46c51edf94da
  spec:
    automountServiceAccountToken: true
    containers:
      - args:
          - dagster
          - api
          - execute_run
        image: myaccountidhere.dkr.ecr.eu-west-1.amazonaws.com/dagster-repository-core:1.0.1509294
        imagePullPolicy: IfNotPresent
        name: dagster
        resources:
          requests:
            cpu: 500m
            memory: 256Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
          - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
            name: kube-api-access-ls9cl
            readOnly: true
          - mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
            name: aws-iam-token
            readOnly: true
    dnsPolicy: ClusterFirst
    enableServiceLinks: true
    nodeSelector:
      karpenter.sh/capacity-type: on-demand
    preemptionPolicy: PreemptLowerPriority
    priority: 0
    restartPolicy: Never
    schedulerName: default-scheduler
    securityContext: {}
    serviceAccount: dagster-jobs-core
    serviceAccountName: dagster-jobs-core
    terminationGracePeriodSeconds: 30
    tolerations:
      - effect: NoExecute
        key: node.kubernetes.io/not-ready
        operator: Exists
        tolerationSeconds: 300
      - effect: NoExecute
        key: node.kubernetes.io/unreachable
        operator: Exists
        tolerationSeconds: 300

This Pod was evicted very shortly after being scheduled:

Pod Condition Transitions

- lastProbeTime: 
  lastTransitionTime: 2024-04-05T17:13:10.0000000Z
  status: True
  type: Initialized
- lastProbeTime: 
  lastTransitionTime: 2024-04-05T17:13:10.0000000Z
  status: True
  type: PodScheduled
- lastProbeTime: 
  lastTransitionTime: 2024-04-05T17:13:16.0000000Z
  message: 'Eviction API: evicting'
  reason: EvictionByEvictionAPI
  status: True
  type: DisruptionTarget
- lastProbeTime: 
  lastTransitionTime: 2024-04-05T17:13:47.0000000Z
  status: False
  type: PodReadyToStartContainers
- lastProbeTime: 
  lastTransitionTime: 2024-04-05T17:13:47.0000000Z
  reason: PodFailed
  status: False
  type: Ready
- lastProbeTime: 
  lastTransitionTime: 2024-04-05T17:13:47.0000000Z
  reason: PodFailed
  status: False
  type: ContainersReady

And based on the following audit log it does seem clear to me that Karpenter initiated this eviction:

Eviction Audit Event

kind: Event
apiVersion: audit.k8s.io/v1
level: RequestResponse
auditID: b93e7d92-3fa7-4a11-99ec-4789388d5d5e
stage: ResponseComplete
requestURI: /api/v1/namespaces/dagster/pods/dagster-run-076c8f10-b1ec-4fcf-a81e-f7366fb60c94-tjmm7/eviction
verb: create
user:
  username: system:serviceaccount:karpenter:karpenter
  uid: 60b39c75-c897-4ee4-8b35-23a191b89833
  groups:
    - system:serviceaccounts
    - system:serviceaccounts:karpenter
    - system:authenticated
  extra:
    authentication.kubernetes.io/pod-name:
      - karpenter-c4cff56cf-nj7fs
    authentication.kubernetes.io/pod-uid:
      - 6f68a05a-e24d-4c89-9b15-08fc77d04034
sourceIPs:
  - # snip
userAgent: karpenter/v0.34.0
objectRef:
  resource: pods
  namespace: dagster
  name: dagster-run-076c8f10-b1ec-4fcf-a81e-f7366fb60c94-tjmm7
  apiVersion: v1
  subresource: eviction
responseStatus:
  metadata: {}
  status: Success
  code: 201
requestObject:
  kind: Eviction
  apiVersion: policy/v1
  metadata:
    name: dagster-run-076c8f10-b1ec-4fcf-a81e-f7366fb60c94-tjmm7
    namespace: dagster
    creationTimestamp: 
responseObject:
  kind: Status
  apiVersion: v1
  metadata: {}
  status: Success
  code: 201
requestReceivedTimestamp: 2024-04-05T17:13:16.8145040Z
stageTimestamp: 2024-04-05T17:13:16.8388540Z
annotations:
  authorization.k8s.io/decision: allow
  authorization.k8s.io/reason: 'RBAC: allowed by ClusterRoleBinding "karpenter-core" of ClusterRole "karpenter-core" to ServiceAccount "karpenter/karpenter"'

Some extra details:

  • I'm using karpenter.sh/do-not-evict but that annotation should still work, right?
  • I'm using Karpenter v0.34.0
  • I'm using k8s 1.29 on AWS EKS
  • I'm fairly confident this is new behaviour. We recently updated from v0.32.1 -> v0.34.0 which is possibly when it started occurring
  • The (very small sample) of failed Pods I've been investigating were all evicted right after being scheduled

@jmdeal
Copy link
Member

jmdeal commented Apr 8, 2024

I think what you've described lines up with the race I described, though this should have existed on v0.32.x as well. I'm speculating that we're seeing this appear more frequently on v0.34+ since it introduced parallel disruption. By increasing the number of consolidation decisions that are being made in a given period of time, we've also increased the chance of this race occurring. It's also going to be dependent on the scale of the cluster since longer scheduling simulation times widen the window for this to occur. We are prioritizing a fix here, I'm hoping to get a PR out in the next couple of days.

@chenfeilee
Copy link
Author

hi @jmdeal sorry for delayed response. I had reached out to our AWS SA/TAM in our organization for the same issue and had also filed a AWS support case. I believe they already got in-touch with you guys with some additional k8s logs/events that I have collated into a document too (sorry for not being able to share the document here as I have not sanitized the information in that)

Looking forward to the release of the fix 🙇

@TimoVink
Copy link

Thanks @jmdeal for looking into this, appreciate the quick turnaround on that PR!

Wondering if you have any suggestions for an interim workaround.

I assume the race condition applies equally regardless of the reason for consolidation (i.e., it applies to both WhenUnderutilized and WhenEmpty?). Does it also apply to nodes being removed due to "expiration"?

I was thinking perhaps I could set consolidationPolicy: Never and expireAfter: 24h or something similar to avoid my Jobs failing intermittently, while still ensuring the cluster is generally right-sized (though not as efficiently as it would be using consolidation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants