Introduce new scaling logic with fix orphan pod issue #1214

TsuyoshiUshio · 2020-10-03T01:13:24Z

I introduce a change for the scaling logic for scaled job.
It is going to solve these issues. I'd like to share this PR for reviewing it first.

Old Logic

The number of newly created jobs are queueLength - runningCount.

New Logic

The number of newly created jobs are

    if (queueLength + runningJobCount) > scaledJob.MaxReplicaCount() {
        effectiveMaxScale = scaledJob.MaxReplicaCount() - runningJobCount
    } else {
        effectiveMaxScale = queueLength
    }

Limitation

ServiceBusScaler using *queueEntity.CountDetails.ActiveMessageCount to fetch the ActiveMessageCount, However, this is not right as a queue length. The value includes the message that is locked. That means, If you receive a queue, and not complete the message, it is locked, and other client can't consume it. However, ActiveMessageCount includes the locked message. I tried other way to fetch the ActiveMessageCount - LockedMessageCount, however, I couldn't find the way to do it until now.

What I did

Introduce new Scaled Job Logic
Fix the orphan pod issue

Checklist

Commits are signed with Developer Certificate of Origin (DCO)
Tests have been added
A PR is opened to update the documentation on https://github.com/kedacore/keda-docs
Changelog has been updated

Fixes #
#1207 (comment)
#1186
#1211

Signed-off-by: Tsuyoshi Ushio <[email protected]>

zroubalik

LGTM thanks!

dron-alterpost · 2020-10-06T15:44:27Z

Thanks for fix! When it will be in Helm chart 2.0-RC?

* Introduce new scaling logic with fix orphan pod issue Signed-off-by: Tsuyoshi Ushio <[email protected]> * update yamls Signed-off-by: Tsuyoshi Ushio <[email protected]> * Remove to fit the coding style Signed-off-by: Tsuyoshi Ushio <[email protected]>

MoKassem · 2022-12-14T15:24:36Z

@TsuyoshiUshio
What scalingStrategy i should use to get this behaviour?! I'm using "accurate" , and still facing the same issue when a job is currently running, and a new message received in the queue, it doesn't scale up a new job.

Actually, i tried all scaling profiles, and still can't get to achieve that when a long executing job is running and a new RabbitMQ is received a new job get created

spec:
  jobTargetRef:
    parallelism: 1                            
    completions: 1
    activeDeadlineSeconds: 21600
    template:
      spec:
        tolerations:
        - key: "node_pool"
          operator: "Equal"
          value: "routing_small"
          effect: "NoSchedule"
        containers:
        - name: axl-routing-sm
          image: us-west1-docker.pkg.dev/axlehire-prod/axl-dcr/axl-routing-controller:0.5.52-kubernetes
          imagePullPolicy: Always
          resources:
            requests:
              memory: "4Gi"
              cpu: 8
            limits:
              cpu: 16
              memory: "8Gi"
        restartPolicy: Never
    backoffLimit: 0  
  pollingInterval: 5                    # Optional. Default: 30 seconds
  minReplicaCount: 0
  maxReplicaCount: 100                  # Optional. Default: 100
  successfulJobsHistoryLimit: 100       # Optional. Default: 100. How many completed jobs should be kept.
  failedJobsHistoryLimit: 100           # Optional. Default: 100. How many failed jobs should be kept.
  scalingStrategy:
    strategy: "accurate"                # Optional. Default: default. Which Scaling Strategy to use. 
    pendingPodConditions:               # Optional. A parameter to calculate pending job count per the specified pod conditions
    - "Pending"
    - "ContainerCreating"
  triggers:
  - type: rabbitmq
    metadata:
      protocol: amqp
      queueName: routing-kubernetes
      mode: QueueLength
      value: "1"
    authenticationRef:
      name: keda-trigger-auth-axl-rabbitmq

TsuyoshiUshio requested review from ahmelsayed and zroubalik as code owners October 3, 2020 01:13

TsuyoshiUshio added 2 commits October 3, 2020 01:29

Introduce new scaling logic with fix orphan pod issue

ceea0f7

Signed-off-by: Tsuyoshi Ushio <[email protected]>

update yamls

6b5c0a4

Signed-off-by: Tsuyoshi Ushio <[email protected]>

TsuyoshiUshio force-pushed the tsushi/fixscalelogic branch from 5b71519 to 6b5c0a4 Compare October 3, 2020 08:31

Remove to fit the coding style

e7c1e3b

Signed-off-by: Tsuyoshi Ushio <[email protected]>

tomkerkhove assigned ahmelsayed and zroubalik Oct 4, 2020

zroubalik approved these changes Oct 5, 2020

View reviewed changes

zroubalik merged commit 8d31493 into v2 Oct 5, 2020

zroubalik deleted the tsushi/fixscalelogic branch October 5, 2020 08:59

audunsol mentioned this pull request Oct 5, 2020

Calculation of number of scale does not consider number of running jobs properly #1222

Closed

tomkerkhove mentioned this pull request Oct 12, 2020

Pod of ScaledJob with Rabbitmq as Scaler lives during 12 minutes after got status "completed" #1211

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce new scaling logic with fix orphan pod issue #1214

Introduce new scaling logic with fix orphan pod issue #1214

TsuyoshiUshio commented Oct 3, 2020 •

edited

Loading

zroubalik left a comment

dron-alterpost commented Oct 6, 2020

MoKassem commented Dec 14, 2022 •

edited

Loading

Introduce new scaling logic with fix orphan pod issue #1214

Introduce new scaling logic with fix orphan pod issue #1214

Conversation

TsuyoshiUshio commented Oct 3, 2020 • edited Loading

Old Logic

New Logic

Limitation

What I did

Checklist

zroubalik left a comment

Choose a reason for hiding this comment

dron-alterpost commented Oct 6, 2020

MoKassem commented Dec 14, 2022 • edited Loading

TsuyoshiUshio commented Oct 3, 2020 •

edited

Loading

MoKassem commented Dec 14, 2022 •

edited

Loading