Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dynamic scheduling and rescheduling functionality based on the usage of node #1777

Closed
3 tasks done
william-wang opened this issue Oct 12, 2021 · 7 comments
Closed
3 tasks done
Assignees
Labels
area/scheduling kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Milestone

Comments

@william-wang
Copy link
Member

william-wang commented Oct 12, 2021

What would you like to be added:

  1. Dynamic scheduling based on the usage of node other than the allocation rate of node.
  2. Re-scheduling the pod on node based on the usage to another node to avoid that parts of node are are under or over utilized.

Why is this needed:

Currently the pod is scheduled based on the resource request and node allocatable resource other than the node usage. This would cause high allocation rate and low utilization rate in some conditions. As a result it's more likely for some nodes to get failure.

SubTasks

@william-wang william-wang added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 12, 2021
@Thor-wl Thor-wl added area/scheduling help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Oct 12, 2021
@justadogistaken
Copy link
Member

I guess the real-time usage of application is dynamic. Scheduling job based on real-time usage will improve system throughput no doubt. But I guess such a scheduling strategy (oversell in some way?) will also bring some risks of resource contention

@Thor-wl
Copy link
Contributor

Thor-wl commented Nov 17, 2021

I guess the real-time usage of application is dynamic. Scheduling job based on real-time usage will improve system throughput no doubt. But I guess such a scheduling strategy (oversell in some way?) will also bring some risks of resource contention

Yes, I agree with that. IMO, it is suitable for scenarios such as mix deplpyment of online services and off-line jobs. That means it is reasonable to evict some off-line jobs when oversell works not so accurately.

@k82cn
Copy link
Member

k82cn commented Nov 17, 2021

There're two scenarios for William's proposal: 1. avoid 100% cpu usage, 2. oversubscription; for this issue, prefer to handle the first scenario.

@justadogistaken
Copy link
Member

I guess the real-time usage of application is dynamic. Scheduling job based on real-time usage will improve system throughput no doubt. But I guess such a scheduling strategy (oversell in some way?) will also bring some risks of resource contention

Yes, I agree with that. IMO, it is suitable for scenarios such as mix deplpyment of online services and off-line jobs. That means it is reasonable to evict some off-line jobs when oversell works not so accurately.

I think it's going to be a huge job more than volcano should/can do. Maybe volcano can provide some basical abilities. Like predicting node usage(system + online workloads), and the predict algorithm could be the vpa-recommender. And batch job completion time prediction(history records). I guess supprotting such features could help system throughput improvement, and avoid the resource contention as best as volcano can.

@Thor-wl
Copy link
Contributor

Thor-wl commented Nov 17, 2021

And batch job completion time prediction(history records)

Yes, it's a good idea ever mentioned in the weekly meeting. It should introduce a good AI model to do that. As the interviews ever, some users has made some tests. It is efficient in specified fileds such as face recognition and network flow forcast while it's hard to get a good model in common and complex business. So there still be not a practice for that.

@Thor-wl Thor-wl self-assigned this Dec 7, 2021
@william-wang william-wang added this to the future-release milestone Dec 17, 2021
@Thor-wl Thor-wl removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Dec 23, 2021
@william-wang william-wang self-assigned this Mar 30, 2022
@Thor-wl
Copy link
Contributor

Thor-wl commented May 9, 2022

/close

@volcano-sh-bot
Copy link
Contributor

@Thor-wl: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/scheduling kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

No branches or pull requests

5 participants