Keda scaled job pods are prematurely terminated by Karpenter #6337

vinayak-shanawad · 2024-06-08T08:16:46Z

Description

Observed Behavior:
We are running the Generative AI workloads using Keda-scaled jobs. But noticed that Keda scaled job pods are prematurely terminated by Karpenter after 14 mins.

For example, I placed a message in an SQS queue, which triggered a Keda job to start a pod. This pod ran for 14 minutes before being terminated.

We have set the following annotation for Karpenter not to disrupt GPU nodes on both the Scaled job and its pod as well.

annotations:
    karpenter.sh/do-not-disrupt: "true"

Karpenter logs:

{"level":"INFO","time":"2024-06-07T14:04:33.106Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"3ec00e6","nodepool":"gpu","nodeclaim":"gpu-86bv5","requests":{"cpu":"190m","memory"
:"248Mi","[nvidia.com/gpu](http://nvidia.com/gpu)":"1","pods":"9"},"instance-types":"g4dn.xlarge"}
{"level":"INFO","time":"2024-06-07T14:04:34.857Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"3ec00e6","nodeclaim":"gpu-86bv5","provider-id":"aws:///us-east-2c/i-02a685
532f5eba4e2","instance-type":"g4dn.xlarge","zone":"us-east-2c","capacity-type":"on-demand","allocatable":{"cpu":"3920m","ephemeral-storage":"359Gi","memory":"14481Mi","[nvidia.com/gpu](http://nvidia.com/gpu)":"1","pods":"29","[vpc.am](http://vpc.am/)
[azonaws.com/pod-eni](http://azonaws.com/pod-eni)":"39"}}
{"level":"INFO","time":"2024-06-07T14:05:26.458Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"3ec00e6","nodeclaim":"gpu-86bv5","provider-id":"aws:///us-east-2c/i-02a
685532f5eba4e2","node":"ip-10-25-66-161.us-east-2.compute.internal"}
{"level":"INFO","time":"2024-06-07T14:08:59.234Z","logger":"controller.disruption","message":"disrupting via consolidation delete, terminating 1 candidates ip-10-25-66-161.us-east-2.compute.internal/g4dn.xla
rge/on-demand","commit":"3ec00e6","command-id":"517ba768-1cac-41bd-9870-867f3249f34c"}
{"level":"INFO","time":"2024-06-07T14:08:59.601Z","logger":"controller.disruption.queue","message":"command succeeded","commit":"3ec00e6","command-id":"517ba768-1cac-41bd-9870-867f3249f34c"}
{"level":"INFO","time":"2024-06-07T14:08:59.664Z","logger":"controller.node.termination","message":"tainted node","commit":"3ec00e6","node":"ip-10-25-66-161.us-east-2.compute.internal"}
{"level":"INFO","time":"2024-06-07T14:09:02.603Z","logger":"controller.node.termination","message":"deleted node","commit":"3ec00e6","node":"ip-10-25-66-161.us-east-2.compute.internal"}
{"level":"INFO","time":"2024-06-07T14:09:03.025Z","logger":"controller.nodeclaim.termination","message":"deleted nodeclaim","commit":"3ec00e6","nodeclaim":"gpu-86bv5","node":"ip-10-25-66-161.us-east-2.comput
e.internal","provider-id":"aws:///us-east-2c/i-02a685532f5eba4e2"}

Expected Behavior:
Keda-scaled jobs should run successfully without terminating the Keda-scaled job pods.

Reproduction Steps (Please include YAML):

Keda scaled job manifest file details.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: video-summarization
  namespace: gen-ai
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::1234:role/gen-ai-video-summarization-pod
---
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: video-summarization-job
  namespace: gen-ai
  annotations:
    karpenter.sh/do-not-disrupt: "true"
spec:
  jobTargetRef:
    parallelism: 1
    template:
      metadata:
	annotations:
	    karpenter.sh/do-not-disrupt: "true" 	
      spec:
        serviceAccountName: video-summarization
        tolerations:
          - key: "accrete.ai/compute-type"
            operator: "Equal"
            value: "gpu"
            effect: "NoSchedule"
        nodeSelector:
          accrete.ai/compute-type: gpu
        hostPID: true
        containers:
        - name: video-summarization-job
          image: <image-details>
          imagePullPolicy: Always
          command: ["python", "main.py"]
          env:
            - name: REQUEST_QUEUE_URL
              value: "https://sqs.us-east-2.amazonaws.com/1234/video-summarization-batch-job-input-queue"
            - name: RESPONSE_QUEUE_URL
              value: "https://sqs.us-east-2.amazonaws.com/1234/video-summarization-batch-job-output-queue"
            - name: AWS_REGION
              value: "us-east-2"
          resources:
            requests:
              nvidia.com/gpu: 1
            limits:
              nvidia.com/gpu: 1
        restartPolicy: Never
  pollingInterval: 60
  maxReplicaCount: 10
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: "https://sqs.us-east-2.amazonaws.com/1234/video-summarization-batch-job-input-queue"
        queueLength: "1"
        awsRegion: us-east-2
      authenticationRef:
        name: video-summarization-keda-job-triggerauth
---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: video-summarization-keda-job-triggerauth
  namespace: gen-ai
spec:
  podIdentity:
    provider: aws
    identityOwner: workload

Versions:

Kubernetes Version (kubectl version):
Client Version: v1.29.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.4-eks-036c24b
Karpenter Version: 0.34.5
Keda Version: ghcr.io/kedacore/keda:2.13.0

The text was updated successfully, but these errors were encountered:

tzneal · 2024-06-10T12:49:18Z

This is kubernetes-sigs/karpenter#1167 and should be resolved in v0.36.2 or v0.37.0. Can you upgrade and confirm that the issue no longer appears?

vinayak-shanawad · 2024-06-27T10:45:39Z

Yeah, this issue is related to Keda but not the Karpenter.

vinayak-shanawad added bug Something isn't working needs-triage Issues that need to be triaged labels Jun 8, 2024

tzneal added question Issues that are support related questions and removed needs-triage Issues that need to be triaged labels Jun 10, 2024

vinayak-shanawad closed this as completed Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keda scaled job pods are prematurely terminated by Karpenter #6337

Keda scaled job pods are prematurely terminated by Karpenter #6337

vinayak-shanawad commented Jun 8, 2024 •

edited

Loading

tzneal commented Jun 10, 2024

vinayak-shanawad commented Jun 27, 2024

Keda scaled job pods are prematurely terminated by Karpenter #6337

Keda scaled job pods are prematurely terminated by Karpenter #6337

Comments

vinayak-shanawad commented Jun 8, 2024 • edited Loading

Description

tzneal commented Jun 10, 2024

vinayak-shanawad commented Jun 27, 2024

vinayak-shanawad commented Jun 8, 2024 •

edited

Loading