Apply scaledobject terminates all running jobs #1021

audunsol · 2020-08-20T20:21:14Z

When applying an update (typically a new container image tag from our CI/CD pipeline) to a ScaledObject with scaleType: job terminates all running jobs.

This does not seem to fit well with the run-to-completion nature of jobs, and we have to make sure deploying new code does not interrupt our long running simulations (the main reason for choosing jobs over deployments).

Expected Behavior

Already started jobs run to completion with the configuration as it was when started.
New jobs triggered (e.g. by new incoming queue messages) should run with the new configuration.

Actual Behavior

Already running jobs and associated pods are terminated and deleted.

Steps to Reproduce the Problem

Define some long running queue triggered job ScaleType:

apiVersion: keda.k8s.io/v1alpha1
kind: ScaledObject
metadata:
  name: my-long-running-scaled-job
  namespace: default
spec:
  scaleType: job
  pollingInterval: 10   # Optional. Default: 30 seconds
  maxReplicaCount: 15  # Optional. Default: 100
  minReplicaCount: 0   # Optional. Default: 0
  cooldownPeriod:  30  # Optional. Default: 300 seconds
  jobTargetRef:
    parallelism: 1 # [max number of desired pods](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#controlling-parallelism)
    completions: 1 # [desired number of successfully finished pods](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#controlling-parallelism)
    activeDeadlineSeconds: 900 # Specifies the duration in seconds relative to the startTime that the job may be active before the system tries to terminate it; value must be positive integer
    backoffLimit: 6 # Specifies the number of retries before marking this job failed. Defaults to 6
    template:
      # describes the [job template](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/)
      metadata:
        labels:
          jobgroup: somejobgroupthing
      spec:
        containers:
          - name: busybox-looping
            image: busybox
            command: ['sh', '-c', 'x=1;while [ $x -le 100 ]; do let y=x*2; let z=x*3; let a=x*4; echo $x $y $z $a ; sleep 1; let x=x+1;done']
            env:
              - name: THE_QUEUE
                value: mytestqueuethatijustaddamessageto
              - name: STORAGE_ACCOUNT_CONNECTION_STRING
                valueFrom:
                  secretKeyRef:
                    name: my-secrets
                    key: STORAGE_ACCOUNT_CONNECTION_STRING
        restartPolicy: Never
  triggers:
    - type: azure-queue
      metadata:
        queueName: mytestqueuethatijustaddamessageto
        queueLength: '20' # Optional. Queue length target for HPA. Default: 5 messages
        connection: STORAGE_ACCOUNT_CONNECTION_STRING

save file and apply it to k8s with kubectl apply -f my-busybox-job-test.yaml
push a message to the queue
observe pods being created and start calculating
do a simple update to the YAML, e.g. spec.jobTargetRef.template.spec.containers.image, or command,
The running jobs/pods are terminated

Specifications

KEDA Version: 1.4.1
Platform & Version: *Azure AKS, *
Kubernetes Version: v1.16.9
Scaler(s): job

The text was updated successfully, but these errors were encountered:

nrjohnstone · 2020-08-24T06:37:53Z

We are seeing this as well with our long running jobs and it does not play nice with the continuous delivery nature of our code bases that are using containers being scaled by KEDA.

The other alternative of course is to ensure that all of your batched jobs running via KEDA jobs are using some kind of saga pattern so when they do get interrupted, if they are driven off a queue with a visibility window, then the job will be kicked off again and you can resume close to where you were. However this depends on the nature of the work being done and is not always possible.

tomkerkhove · 2020-08-24T07:11:05Z

@TsuyoshiUshio Is this behavior the same with 2.0?

audunsol · 2020-09-15T07:32:00Z

I have upgraded to keda-2.0.0-beta on our test cluster now, and as far as I can see, this issue seems to be fixed there. Thanks!

I am happy to close this issue then, unless you would like to address this somehow for 1.x as well (behavior and/or its docs or something).

tomkerkhove · 2020-09-15T07:57:02Z

Let's close this then indeed, we don't have concrete plans to ship a new 1.x version.

audunsol added the bug Something isn't working label Aug 20, 2020

tomkerkhove added this to the v2.0 milestone Sep 15, 2020

tomkerkhove closed this as completed Sep 15, 2020

etamarw mentioned this issue Sep 12, 2021

operator deletes all running jobs whenever the relevant scaledjob object gets updated, and recreates them based on the new scaledjob spec. #2098

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply scaledobject terminates all running jobs #1021

Apply scaledobject terminates all running jobs #1021

audunsol commented Aug 20, 2020

nrjohnstone commented Aug 24, 2020

tomkerkhove commented Aug 24, 2020

audunsol commented Sep 15, 2020

tomkerkhove commented Sep 15, 2020

Apply scaledobject terminates all running jobs #1021

Apply scaledobject terminates all running jobs #1021

Comments

audunsol commented Aug 20, 2020

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications

nrjohnstone commented Aug 24, 2020

tomkerkhove commented Aug 24, 2020

audunsol commented Sep 15, 2020

tomkerkhove commented Sep 15, 2020