Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduling with update from v1.2.0 to v1.4.0 #1775

Closed
regadas opened this issue Oct 8, 2021 · 12 comments · Fixed by #2602
Closed

scheduling with update from v1.2.0 to v1.4.0 #1775

regadas opened this issue Oct 8, 2021 · 12 comments · Fixed by #2602
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@regadas
Copy link

regadas commented Oct 8, 2021

What happened:

volcano update from 1.2.0 to 1.4.0. With the newest version if there are not enough resources PodGroups are kept in Pending phase and cluster autoscaler does not trigger to provision more resources.

Did I miss smth in the latest version?

What you expected to happen:

I was expecting it to work as in 1.2.0

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Volcano Version: 1.4.0
  • Kubernetes version (use kubectl version): 1.22.2
  • Cloud provider or hardware configuration: GKE
  • OS (e.g. from /etc/os-release): Container OS
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@regadas regadas added the kind/bug Categorizes issue or PR as related to a bug. label Oct 8, 2021
@Thor-wl
Copy link
Contributor

Thor-wl commented Oct 19, 2021

/assign @Thor-wl

@Thor-wl
Copy link
Contributor

Thor-wl commented Nov 1, 2021

Well, pls give more details about your testing steps so that I can reproduce it. THX.

@stale
Copy link

stale bot commented Jan 30, 2022

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 30, 2022
@yolgun
Copy link

yolgun commented Feb 15, 2022

Hi @Thor-wl. I am able to reproduce this issue using a GKE Kubernetes cluster with autoscaling enabled. Creating a podgroup that can't be satisfied with current resources is enough. Prior to v1.4.0, scaleUp is triggered which can be seen in events. After v1.4.0, this event doesn't happen.

Not sure on how to reproduce it locally, but we have investigated it further on our side. It happens after this PR. With this, volcano has started putting custom reasons like Undetermined into a pod's status.conditions.reason field. Kubernetes Cluster Autoscaler uses the same field to detect ScaleUp needs. But it only checks Unschedulable. Related piece of code at the Cluster Autoscaler can be seen here.

I tested it by reverting the PR on top of v1.5.0-beta and autoscaling worked as before.

I'd appreciate any help on solving this in volcano. Both autoscaling and batch-scheduling is important to our setup.

@stale stale bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 15, 2022
@fadi-artera
Copy link

Hello we are having the same issues and would appreciate if there's an update on this issue.

@Thor-wl
Copy link
Contributor

Thor-wl commented Mar 22, 2022

Thanks, guys. Let me take a look at that.

@Thor-wl Thor-wl added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Mar 22, 2022
@brickyard
Copy link

Could we re-open and make an update here? Volcano is pretty much unusable with Cluster Autoscaler and Karpenter with the "Undetermined" reason. Is there any reason why we shouldnt revert the PR to gain back compatibility with the autoscaling\cloud eco-system? Would love to hear from the team on this.

@william-wang
Copy link
Member

@brickyard Of course, please update here. Maybe the scheduling reason enhanced in pr#1672 missed to consider the interacting between scheduler and the autoscaler. @Thor-wl please continue to work on this to fix it. if there is not a way to take care of both the autoscaling and scheduler reason enhancement. We need to revert to keep the compatibility firstly.

@stale
Copy link

stale bot commented Aug 10, 2022

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 10, 2022
@stale
Copy link

stale bot commented Oct 14, 2022

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Oct 14, 2022
@tgaddair
Copy link
Contributor

This issue is still affecting Karpenter users. Can we re-open and find a way to set the pod status to Unschedulable instead of "Undetermined"? Is there a reason it should be (or need to be) "Undetermined"?

@tgaddair
Copy link
Contributor

I think there should be no issues with reverting #1672, as the intent was to provide more information to the user. But if it breaks compatibility with cluster autoscalers, that seems like a very steep price to pay for better logging. Maybe this PR could be re-submitted by just annotating the status message with this info, instead of changing the Unschedulable reason?

@william-wang william-wang reopened this Dec 21, 2022
@stale stale bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants