Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keda scaled job pods are prematurely terminated by Karpenter #6337

Closed
vinayak-shanawad opened this issue Jun 8, 2024 · 2 comments
Closed
Labels
bug Something isn't working question Issues that are support related questions

Comments

@vinayak-shanawad
Copy link

vinayak-shanawad commented Jun 8, 2024

Description

Observed Behavior:
We are running the Generative AI workloads using Keda-scaled jobs. But noticed that Keda scaled job pods are prematurely terminated by Karpenter after 14 mins.

For example, I placed a message in an SQS queue, which triggered a Keda job to start a pod. This pod ran for 14 minutes before being terminated.

We have set the following annotation for Karpenter not to disrupt GPU nodes on both the Scaled job and its pod as well.

annotations:
    karpenter.sh/do-not-disrupt: "true"

Karpenter logs:

{"level":"INFO","time":"2024-06-07T14:04:33.106Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"3ec00e6","nodepool":"gpu","nodeclaim":"gpu-86bv5","requests":{"cpu":"190m","memory"
:"248Mi","[nvidia.com/gpu](http://nvidia.com/gpu)":"1","pods":"9"},"instance-types":"g4dn.xlarge"}
{"level":"INFO","time":"2024-06-07T14:04:34.857Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"3ec00e6","nodeclaim":"gpu-86bv5","provider-id":"aws:///us-east-2c/i-02a685
532f5eba4e2","instance-type":"g4dn.xlarge","zone":"us-east-2c","capacity-type":"on-demand","allocatable":{"cpu":"3920m","ephemeral-storage":"359Gi","memory":"14481Mi","[nvidia.com/gpu](http://nvidia.com/gpu)":"1","pods":"29","[vpc.am](http://vpc.am/)
[azonaws.com/pod-eni](http://azonaws.com/pod-eni)":"39"}}
{"level":"INFO","time":"2024-06-07T14:05:26.458Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"3ec00e6","nodeclaim":"gpu-86bv5","provider-id":"aws:///us-east-2c/i-02a
685532f5eba4e2","node":"ip-10-25-66-161.us-east-2.compute.internal"}
{"level":"INFO","time":"2024-06-07T14:08:59.234Z","logger":"controller.disruption","message":"disrupting via consolidation delete, terminating 1 candidates ip-10-25-66-161.us-east-2.compute.internal/g4dn.xla
rge/on-demand","commit":"3ec00e6","command-id":"517ba768-1cac-41bd-9870-867f3249f34c"}
{"level":"INFO","time":"2024-06-07T14:08:59.601Z","logger":"controller.disruption.queue","message":"command succeeded","commit":"3ec00e6","command-id":"517ba768-1cac-41bd-9870-867f3249f34c"}
{"level":"INFO","time":"2024-06-07T14:08:59.664Z","logger":"controller.node.termination","message":"tainted node","commit":"3ec00e6","node":"ip-10-25-66-161.us-east-2.compute.internal"}
{"level":"INFO","time":"2024-06-07T14:09:02.603Z","logger":"controller.node.termination","message":"deleted node","commit":"3ec00e6","node":"ip-10-25-66-161.us-east-2.compute.internal"}
{"level":"INFO","time":"2024-06-07T14:09:03.025Z","logger":"controller.nodeclaim.termination","message":"deleted nodeclaim","commit":"3ec00e6","nodeclaim":"gpu-86bv5","node":"ip-10-25-66-161.us-east-2.comput
e.internal","provider-id":"aws:///us-east-2c/i-02a685532f5eba4e2"}

Expected Behavior:
Keda-scaled jobs should run successfully without terminating the Keda-scaled job pods.

Reproduction Steps (Please include YAML):

Keda scaled job manifest file details.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: video-summarization
  namespace: gen-ai
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::1234:role/gen-ai-video-summarization-pod
---
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: video-summarization-job
  namespace: gen-ai
  annotations:
    karpenter.sh/do-not-disrupt: "true"
spec:
  jobTargetRef:
    parallelism: 1
    template:
      metadata:
	annotations:
	    karpenter.sh/do-not-disrupt: "true" 	
      spec:
        serviceAccountName: video-summarization
        tolerations:
          - key: "accrete.ai/compute-type"
            operator: "Equal"
            value: "gpu"
            effect: "NoSchedule"
        nodeSelector:
          accrete.ai/compute-type: gpu
        hostPID: true
        containers:
        - name: video-summarization-job
          image: <image-details>
          imagePullPolicy: Always
          command: ["python", "main.py"]
          env:
            - name: REQUEST_QUEUE_URL
              value: "https://sqs.us-east-2.amazonaws.com/1234/video-summarization-batch-job-input-queue"
            - name: RESPONSE_QUEUE_URL
              value: "https://sqs.us-east-2.amazonaws.com/1234/video-summarization-batch-job-output-queue"
            - name: AWS_REGION
              value: "us-east-2"
          resources:
            requests:
              nvidia.com/gpu: 1
            limits:
              nvidia.com/gpu: 1
        restartPolicy: Never
  pollingInterval: 60
  maxReplicaCount: 10
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: "https://sqs.us-east-2.amazonaws.com/1234/video-summarization-batch-job-input-queue"
        queueLength: "1"
        awsRegion: us-east-2
      authenticationRef:
        name: video-summarization-keda-job-triggerauth
---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: video-summarization-keda-job-triggerauth
  namespace: gen-ai
spec:
  podIdentity:
    provider: aws
    identityOwner: workload

Versions:

  • Kubernetes Version (kubectl version):
    Client Version: v1.29.0
    Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
    Server Version: v1.29.4-eks-036c24b

  • Karpenter Version: 0.34.5

  • Keda Version: ghcr.io/kedacore/keda:2.13.0

@vinayak-shanawad vinayak-shanawad added bug Something isn't working needs-triage Issues that need to be triaged labels Jun 8, 2024
@tzneal
Copy link
Contributor

tzneal commented Jun 10, 2024

This is kubernetes-sigs/karpenter#1167 and should be resolved in v0.36.2 or v0.37.0. Can you upgrade and confirm that the issue no longer appears?

@tzneal tzneal added question Issues that are support related questions and removed needs-triage Issues that need to be triaged labels Jun 10, 2024
@vinayak-shanawad
Copy link
Author

Yeah, this issue is related to Keda but not the Karpenter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Issues that are support related questions
Projects
None yet
Development

No branches or pull requests

2 participants