-
Notifications
You must be signed in to change notification settings - Fork 993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Volcano job support scale up and down #782
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,103 @@ | ||
# Volcano Job scale up and down | ||
|
||
@hzxuzhonghu; April 24, 2020 | ||
|
||
## Motivation | ||
|
||
Currently, Volcano does not support Job update. It is not allowed to update the `Job.Spec` on the fly. | ||
However, many users show appeal to run ML training jobs in a elastic manner. For example ModelArts want to dynamically adjust Job's replicas according to the cluster idle capacity | ||
in order to achieve most high efficiency on GPU card. | ||
|
||
I propose to support volcano job dynamical scale up/down before more intelligent elasticity in the first step. | ||
|
||
## Design | ||
|
||
Before this design, let's recall the current Job's initialization | ||
|
||
### Job Initialization | ||
|
||
When a Volcano job is created, the job controller does the following to run/manage all of its tasks. | ||
|
||
1. all the plugins execute OnJobAdd callbacks to create service and hosts configmap, etc | ||
|
||
2. create pvc for the job | ||
|
||
3. create PodGroup for the job | ||
|
||
4. execute plugins' OnPodAdd callbacks to set pod related env, mount hostfile, etc | ||
|
||
5. call the kube-apiserver to create pods equals the replicas of the job | ||
|
||
All above steps are run in `syncJob`, which is called when external events happen, for this it happens when Job is newly created. | ||
|
||
### Volcano Job Scale Up/Down | ||
|
||
The Job's scale up and down correlates to reconciling of the resources the job owns, like PVC/PodGroup/Service/HostFile ConfigMap | ||
so the procedure is kind of similar to the [Job Initialization](#Job Initialization). | ||
|
||
The differences are: | ||
|
||
1. job plugins' callbacks:only the `svc` plugin should update the configmap including the job tasks | ||
|
||
2. create pods when scale up | ||
|
||
3. delete pods when scale down | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does pod of volcano job has corresponding headless service? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it is similar to the statefulset pods, we can not delete the headless service when deleting the pods. But we will update the accessible hostfile. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just curious, Any reason headless service can not be deleted? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here we just scale down, there are still some tasks maybe ps exist for tensoflow job. BTW, the headless service is deleted when the job completes or deleted. |
||
|
||
However, only when the job is not started, the initialization is run. | ||
So we need a way to know whether it is a scale up/down event that triggered this round of sync. | ||
|
||
The way I propose is to add a new event `JobUpdatedEvent` to indicate that the job is updated(here only cares about the scale up/down). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When Pod was created/deleted, how /when to handle configmap? It's better to highlight the time sequence for the user. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would add a new There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
So? we need to highlight when create pod, when configmap was updated and which pod will be deleted when scale down. |
||
And accordingly add a new action `UpdateJobAction` to run `UpdateJob` function. And the overall workflow is: | ||
![workflow](images/Job-scale-up-down.PNG) | ||
|
||
To scale up/down on the fly, Volcano should be responsible to notify the original pods the current status, including the hosts of all the pods. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we probably need more update categories here.
I assume only specific change will trigger There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These kinds of updates are prohibited. |
||
This is done by plugins, so to distinguish from the initialization phase, a new `OnJobUpdate` is introduced. | ||
It is to reconcile all the associated configs of the job. Currently, the `svc` plugin should update the configmap of all the hosts. | ||
|
||
**NOTE**: Users should watch the `/etc/volcano` to get the up-to-date hosts files if they want to be aware of the training workers. | ||
|
||
``` | ||
type PluginInterface interface { | ||
// The unique name of Plugin. | ||
Name() string | ||
|
||
// for all pod when createJobPod | ||
OnPodCreate(pod *v1.Pod, job *vcbatch.Job) error | ||
|
||
// do once when syncJob | ||
OnJobAdd(job *vcbatch.Job) error | ||
|
||
// do once when killJob | ||
OnJobDelete(job *vcbatch.Job) error | ||
|
||
OnJobUpdate(job *vcbatch.Job) error | ||
} | ||
``` | ||
|
||
`UpdateJob` is much like the current `SyncJob`, and it's workflow is: | ||
|
||
1. all the plugins execute OnJobUpdate callbacks, which is to update all the envs, service and hosts configmap. | ||
|
||
2. create pvc for the job if necessary | ||
|
||
3. update PodGroup for the job if necessary | ||
|
||
4. execute plugins' OnPodAdd callbacks to set pod related env, mount hostfile, etc | ||
|
||
5. call the kube-apiserver to create/delete pods equals the replicas of the job | ||
|
||
|
||
**Note**: when scale down, the pod delete order is from the larger indexed to the lower indexed. But this is not guaranteed as Kubernetes is a eventual consistent system. | ||
|
||
|
||
|
||
### Admission webhook | ||
|
||
Should prevent invalid mutating Job Spec on the fly. In this proposal, we only allow `replicas` and `minAvailable` update. Any other spec changes will be prohibited. | ||
It is also not allowed if the number of total replicas is less than the `minAvailable`. | ||
|
||
`minAvailable` must be greater than zero, we depend on it to maintain the job status. | ||
|
||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the scope of this proposal is to make sure controller can response to the job scale up and down. no scale up/down decision need to be made here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, correct.