Add dynamic scheduling and rescheduling functionality based on the usage of node #1777

william-wang · 2021-10-12T01:36:00Z

What would you like to be added:

Dynamic scheduling based on the usage of node other than the allocation rate of node.
Re-scheduling the pod on node based on the usage to another node to avoid that parts of node are are under or over utilized.

Why is this needed:

Currently the pod is scheduled based on the resource request and node allocatable resource other than the node usage. This would cause high allocation rate and low utilization rate in some conditions. As a result it's more likely for some nodes to get failure.

SubTasks

Usage based scheduling
Support getting metrics from metrics server in cache and add uaged based scheduling plugin
Design doc(pr add design doc for usage based scheduling #2023 )
Implemetation(pr add usage based scheduling plugin #2129)
Rescheduling plugin
Re-scheduling the pod on node based on the usage
Design doc(pr add design doc for rescheduling #1923 )
Implemetation(pr add rescheduling plugin #2098 )
Shuffle action
Add shuffle action for evicting tasks for TDM and rescheduling scenarios
Design doc(pr add design doc for rescheduling #1923)
Implemetation shuffle action(pr add shuffle action #2099)
Implemetation rescheduling plugin (pr add shuffle action #2099)

justadogistaken · 2021-11-17T07:00:50Z

I guess the real-time usage of application is dynamic. Scheduling job based on real-time usage will improve system throughput no doubt. But I guess such a scheduling strategy (oversell in some way?) will also bring some risks of resource contention

Thor-wl · 2021-11-17T07:10:25Z

I guess the real-time usage of application is dynamic. Scheduling job based on real-time usage will improve system throughput no doubt. But I guess such a scheduling strategy (oversell in some way?) will also bring some risks of resource contention

Yes, I agree with that. IMO, it is suitable for scenarios such as mix deplpyment of online services and off-line jobs. That means it is reasonable to evict some off-line jobs when oversell works not so accurately.

k82cn · 2021-11-17T08:10:01Z

There're two scenarios for William's proposal: 1. avoid 100% cpu usage, 2. oversubscription; for this issue, prefer to handle the first scenario.

justadogistaken · 2021-11-17T08:47:22Z

I guess the real-time usage of application is dynamic. Scheduling job based on real-time usage will improve system throughput no doubt. But I guess such a scheduling strategy (oversell in some way?) will also bring some risks of resource contention

Yes, I agree with that. IMO, it is suitable for scenarios such as mix deplpyment of online services and off-line jobs. That means it is reasonable to evict some off-line jobs when oversell works not so accurately.

I think it's going to be a huge job more than volcano should/can do. Maybe volcano can provide some basical abilities. Like predicting node usage(system + online workloads), and the predict algorithm could be the vpa-recommender. And batch job completion time prediction(history records). I guess supprotting such features could help system throughput improvement, and avoid the resource contention as best as volcano can.

Thor-wl · 2021-11-17T09:34:13Z

And batch job completion time prediction(history records)

Yes, it's a good idea ever mentioned in the weekly meeting. It should introduce a good AI model to do that. As the interviews ever, some users has made some tests. It is efficient in specified fileds such as face recognition and network flow forcast while it's hard to get a good model in common and complex business. So there still be not a practice for that.

Thor-wl · 2022-05-09T02:10:35Z

/close

volcano-sh-bot · 2022-05-09T02:10:38Z

@Thor-wl: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

william-wang added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 12, 2021

Thor-wl added area/scheduling help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Oct 12, 2021

Thor-wl self-assigned this Dec 7, 2021

william-wang added this to the future-release milestone Dec 17, 2021

Thor-wl mentioned this issue Dec 22, 2021

support descheduler for volcano #1917

Closed

Thor-wl removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Dec 23, 2021

Thor-wl mentioned this issue Dec 28, 2021

add design doc for rescheduling #1923

Merged

This was referenced Feb 16, 2022

Add dynamic scheduling support #2022

Closed

add design doc for usage based scheduling #2023

Merged

This was referenced Mar 16, 2022

[WIP] support rescheduling based on realtime performance #2092

Closed

add rescheduling plugin #2098

Closed

add shuffle action #2099

Closed

william-wang mentioned this issue Mar 29, 2022

add usage based scheduling plugin #2129

Merged

william-wang self-assigned this Mar 30, 2022

Thor-wl mentioned this issue Apr 21, 2022

add rescheduling plugin #2184

Merged

volcano-sh-bot closed this as completed May 9, 2022

william-wang modified the milestones: roadmap, v1.6 May 13, 2022

lowang-bh mentioned this issue Aug 2, 2023

volcano support descheduler plugin #3023

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dynamic scheduling and rescheduling functionality based on the usage of node #1777

Add dynamic scheduling and rescheduling functionality based on the usage of node #1777

william-wang commented Oct 12, 2021 •

edited by Thor-wl

Loading

justadogistaken commented Nov 17, 2021

Thor-wl commented Nov 17, 2021

k82cn commented Nov 17, 2021

justadogistaken commented Nov 17, 2021

Thor-wl commented Nov 17, 2021

Thor-wl commented May 9, 2022

volcano-sh-bot commented May 9, 2022

Add dynamic scheduling and rescheduling functionality based on the usage of node #1777

Add dynamic scheduling and rescheduling functionality based on the usage of node #1777

Comments

william-wang commented Oct 12, 2021 • edited by Thor-wl Loading

What would you like to be added:

Why is this needed:

justadogistaken commented Nov 17, 2021

Thor-wl commented Nov 17, 2021

k82cn commented Nov 17, 2021

justadogistaken commented Nov 17, 2021

Thor-wl commented Nov 17, 2021

Thor-wl commented May 9, 2022

volcano-sh-bot commented May 9, 2022

william-wang commented Oct 12, 2021 •

edited by Thor-wl

Loading