-
Notifications
You must be signed in to change notification settings - Fork 963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dynamic scheduling and rescheduling functionality based on the usage of node #1777
Comments
I guess the real-time usage of application is dynamic. Scheduling job based on real-time usage will improve system throughput no doubt. But I guess such a scheduling strategy (oversell in some way?) will also bring some risks of resource contention |
Yes, I agree with that. IMO, it is suitable for scenarios such as mix deplpyment of online services and off-line jobs. That means it is reasonable to evict some off-line jobs when oversell works not so accurately. |
There're two scenarios for William's proposal: 1. avoid 100% cpu usage, 2. oversubscription; for this issue, prefer to handle the first scenario. |
I think it's going to be a huge job more than volcano should/can do. Maybe volcano can provide some basical abilities. Like predicting node usage(system + online workloads), and the predict algorithm could be the vpa-recommender. And batch job completion time prediction(history records). I guess supprotting such features could help system throughput improvement, and avoid the resource contention as best as volcano can. |
Yes, it's a good idea ever mentioned in the weekly meeting. It should introduce a good AI model to do that. As the interviews ever, some users has made some tests. It is efficient in specified fileds such as face recognition and network flow forcast while it's hard to get a good model in common and complex business. So there still be not a practice for that. |
/close |
@Thor-wl: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What would you like to be added:
Why is this needed:
Currently the pod is scheduled based on the resource request and node allocatable resource other than the node usage. This would cause high allocation rate and low utilization rate in some conditions. As a result it's more likely for some nodes to get failure.
SubTasks
Support getting metrics from metrics server in cache and add uaged based scheduling plugin
Design doc(pr add design doc for usage based scheduling #2023 )
Implemetation(pr add usage based scheduling plugin #2129)
Re-scheduling the pod on node based on the usage
Design doc(pr add design doc for rescheduling #1923 )
Implemetation(pr add rescheduling plugin #2098 )
Add shuffle action for evicting tasks for TDM and rescheduling scenarios
Design doc(pr add design doc for rescheduling #1923)
Implemetation shuffle action(pr add shuffle action #2099)
Implemetation rescheduling plugin (pr add shuffle action #2099)
The text was updated successfully, but these errors were encountered: