Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply scaledobject terminates all running jobs #1021

Closed
audunsol opened this issue Aug 20, 2020 · 4 comments
Closed

Apply scaledobject terminates all running jobs #1021

audunsol opened this issue Aug 20, 2020 · 4 comments
Labels
bug Something isn't working
Milestone

Comments

@audunsol
Copy link

When applying an update (typically a new container image tag from our CI/CD pipeline) to a ScaledObject with scaleType: job terminates all running jobs.

This does not seem to fit well with the run-to-completion nature of jobs, and we have to make sure deploying new code does not interrupt our long running simulations (the main reason for choosing jobs over deployments).

Expected Behavior

Already started jobs run to completion with the configuration as it was when started.
New jobs triggered (e.g. by new incoming queue messages) should run with the new configuration.

Actual Behavior

Already running jobs and associated pods are terminated and deleted.

Steps to Reproduce the Problem

  1. Define some long running queue triggered job ScaleType:
apiVersion: keda.k8s.io/v1alpha1
kind: ScaledObject
metadata:
  name: my-long-running-scaled-job
  namespace: default
spec:
  scaleType: job
  pollingInterval: 10   # Optional. Default: 30 seconds
  maxReplicaCount: 15  # Optional. Default: 100
  minReplicaCount: 0   # Optional. Default: 0
  cooldownPeriod:  30  # Optional. Default: 300 seconds
  jobTargetRef:
    parallelism: 1 # [max number of desired pods](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#controlling-parallelism)
    completions: 1 # [desired number of successfully finished pods](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#controlling-parallelism)
    activeDeadlineSeconds: 900 # Specifies the duration in seconds relative to the startTime that the job may be active before the system tries to terminate it; value must be positive integer
    backoffLimit: 6 # Specifies the number of retries before marking this job failed. Defaults to 6
    template:
      # describes the [job template](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/)
      metadata:
        labels:
          jobgroup: somejobgroupthing
      spec:
        containers:
          - name: busybox-looping
            image: busybox
            command: ['sh', '-c', 'x=1;while [ $x -le 100 ]; do let y=x*2; let z=x*3; let a=x*4; echo $x $y $z $a ; sleep 1; let x=x+1;done']
            env:
              - name: THE_QUEUE
                value: mytestqueuethatijustaddamessageto
              - name: STORAGE_ACCOUNT_CONNECTION_STRING
                valueFrom:
                  secretKeyRef:
                    name: my-secrets
                    key: STORAGE_ACCOUNT_CONNECTION_STRING
        restartPolicy: Never
  triggers:
    - type: azure-queue
      metadata:
        queueName: mytestqueuethatijustaddamessageto
        queueLength: '20' # Optional. Queue length target for HPA. Default: 5 messages
        connection: STORAGE_ACCOUNT_CONNECTION_STRING
  1. save file and apply it to k8s with kubectl apply -f my-busybox-job-test.yaml
  2. push a message to the queue
  3. observe pods being created and start calculating
  4. do a simple update to the YAML, e.g. spec.jobTargetRef.template.spec.containers.image, or command,
  5. The running jobs/pods are terminated

Specifications

  • KEDA Version: 1.4.1
  • Platform & Version: *Azure AKS, *
  • Kubernetes Version: v1.16.9
  • Scaler(s): job
@audunsol audunsol added the bug Something isn't working label Aug 20, 2020
@nrjohnstone
Copy link

We are seeing this as well with our long running jobs and it does not play nice with the continuous delivery nature of our code bases that are using containers being scaled by KEDA.

The other alternative of course is to ensure that all of your batched jobs running via KEDA jobs are using some kind of saga pattern so when they do get interrupted, if they are driven off a queue with a visibility window, then the job will be kicked off again and you can resume close to where you were. However this depends on the nature of the work being done and is not always possible.

@tomkerkhove
Copy link
Member

@TsuyoshiUshio Is this behavior the same with 2.0?

@audunsol
Copy link
Author

I have upgraded to keda-2.0.0-beta on our test cluster now, and as far as I can see, this issue seems to be fixed there. Thanks!

I am happy to close this issue then, unless you would like to address this somehow for 1.x as well (behavior and/or its docs or something).

@tomkerkhove
Copy link
Member

Let's close this then indeed, we don't have concrete plans to ship a new 1.x version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants