scheduling with update from v1.2.0 to v1.4.0 #1775

regadas · 2021-10-08T14:36:23Z

What happened:

volcano update from 1.2.0 to 1.4.0. With the newest version if there are not enough resources PodGroups are kept in Pending phase and cluster autoscaler does not trigger to provision more resources.

Did I miss smth in the latest version?

What you expected to happen:

I was expecting it to work as in 1.2.0

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Volcano Version: 1.4.0
Kubernetes version (use kubectl version): 1.22.2
Cloud provider or hardware configuration: GKE
OS (e.g. from /etc/os-release): Container OS
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

Thor-wl · 2021-10-19T01:31:25Z

/assign @Thor-wl

Thor-wl · 2021-11-01T01:31:49Z

Well, pls give more details about your testing steps so that I can reproduce it. THX.

stale · 2022-01-30T03:50:20Z

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

yolgun · 2022-02-15T12:08:48Z

Hi @Thor-wl. I am able to reproduce this issue using a GKE Kubernetes cluster with autoscaling enabled. Creating a podgroup that can't be satisfied with current resources is enough. Prior to v1.4.0, scaleUp is triggered which can be seen in events. After v1.4.0, this event doesn't happen.

Not sure on how to reproduce it locally, but we have investigated it further on our side. It happens after this PR. With this, volcano has started putting custom reasons like Undetermined into a pod's status.conditions.reason field. Kubernetes Cluster Autoscaler uses the same field to detect ScaleUp needs. But it only checks Unschedulable. Related piece of code at the Cluster Autoscaler can be seen here.

I tested it by reverting the PR on top of v1.5.0-beta and autoscaling worked as before.

I'd appreciate any help on solving this in volcano. Both autoscaling and batch-scheduling is important to our setup.

fadi-artera · 2022-03-22T01:04:24Z

Hello we are having the same issues and would appreciate if there's an update on this issue.

Thor-wl · 2022-03-22T01:43:37Z

Thanks, guys. Let me take a look at that.

brickyard · 2022-05-06T22:21:03Z

Could we re-open and make an update here? Volcano is pretty much unusable with Cluster Autoscaler and Karpenter with the "Undetermined" reason. Is there any reason why we shouldnt revert the PR to gain back compatibility with the autoscaling\cloud eco-system? Would love to hear from the team on this.

william-wang · 2022-05-07T01:12:49Z

@brickyard Of course, please update here. Maybe the scheduling reason enhanced in pr#1672 missed to consider the interacting between scheduler and the autoscaler. @Thor-wl please continue to work on this to fix it. if there is not a way to take care of both the autoscaling and scheduler reason enhancement. We need to revert to keep the compatibility firstly.

stale · 2022-08-10T03:18:49Z

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale · 2022-10-14T03:45:32Z

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

tgaddair · 2022-12-11T02:37:21Z

This issue is still affecting Karpenter users. Can we re-open and find a way to set the pod status to Unschedulable instead of "Undetermined"? Is there a reason it should be (or need to be) "Undetermined"?

tgaddair · 2022-12-11T03:32:33Z

I think there should be no issues with reverting #1672, as the intent was to provide more information to the user. But if it breaks compatibility with cluster autoscalers, that seems like a very steep price to pay for better logging. Maybe this PR could be re-submitted by just annotating the status message with this info, instead of changing the Unschedulable reason?

regadas added the kind/bug Categorizes issue or PR as related to a bug. label Oct 8, 2021

volcano-sh-bot assigned Thor-wl Oct 19, 2021

stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 30, 2022

stale bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 15, 2022

Thor-wl added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Mar 22, 2022

stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 10, 2022

stale bot closed this as completed Oct 14, 2022

tgaddair mentioned this issue Dec 12, 2022

Remove Undetermined reason to fix cluster autoscaler compatibility #2602

Merged

william-wang reopened this Dec 21, 2022

stale bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 21, 2022

volcano-sh-bot closed this as completed in #2602 Dec 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduling with update from v1.2.0 to v1.4.0 #1775

scheduling with update from v1.2.0 to v1.4.0 #1775

regadas commented Oct 8, 2021

Thor-wl commented Oct 19, 2021

Thor-wl commented Nov 1, 2021

stale bot commented Jan 30, 2022

yolgun commented Feb 15, 2022

fadi-artera commented Mar 22, 2022

Thor-wl commented Mar 22, 2022

brickyard commented May 6, 2022

william-wang commented May 7, 2022

stale bot commented Aug 10, 2022

stale bot commented Oct 14, 2022

tgaddair commented Dec 11, 2022

tgaddair commented Dec 11, 2022

scheduling with update from v1.2.0 to v1.4.0 #1775

scheduling with update from v1.2.0 to v1.4.0 #1775

Comments

regadas commented Oct 8, 2021

Thor-wl commented Oct 19, 2021

Thor-wl commented Nov 1, 2021

stale bot commented Jan 30, 2022

yolgun commented Feb 15, 2022

fadi-artera commented Mar 22, 2022

Thor-wl commented Mar 22, 2022

brickyard commented May 6, 2022

william-wang commented May 7, 2022

stale bot commented Aug 10, 2022

stale bot commented Oct 14, 2022

tgaddair commented Dec 11, 2022

tgaddair commented Dec 11, 2022