Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Volcano job support scale up and down #782

Merged
merged 4 commits into from
May 8, 2020
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/design/images/Job-scale-up-down.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
63 changes: 63 additions & 0 deletions docs/design/job-scale-up-down.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Volcano Job scale up and down
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the scope of this proposal is to make sure controller can response to the job scale up and down. no scale up/down decision need to be made here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, correct.


@hzxuzhonghu; April 24, 2020

## Motivation

Currently, Volcano does not support Job update. It is not allowed to update the `Job.Spec` on the fly.
However, users like ModelArts want to dynamically adjust Job's replicas according to the cluster idle resources
Copy link
Member

@k82cn k82cn Apr 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not only for ModelArts; AFAIK, several ML framework already support elastic model, e.g. https://github.com/pytorch/elastic

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will read about that for more context

in order to achieve most high efficiency on GPU card.

## Design

Before this design, let's recall the current Job's initialization

### Job Initialization

When a Volcano job is created, the job controller does the following to run/manage all of its tasks.

1. all the plugins execute OnJobAdd callbacks

2. create pvc for the job

3. create PodGroup for the job

4. create pods equals the replicas of the job

All above steps are run in `syncJob`, which is called when external events happen, for this it happens when Job is newly created.

### Volcano Job Scale Up/Down

The Job's scale up and down correlates to reconciling of the resources the job owns, like PVC/PodGroup/Service/HostFile ConfigMap
so the procedure is kind of similar to the [Job Initialization](#Job Initialization).

The differences are:

1. job plugins' callbacks:only the `svc` plugin should update the configmap including the job tasks

2. create pods when scale up

3. delete pods when scale down
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does pod of volcano job has corresponding headless service?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is similar to the statefulset pods, we can not delete the headless service when deleting the pods. But we will update the accessible hostfile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, Any reason headless service can not be deleted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we just scale down, there are still some tasks maybe ps exist for tensoflow job. BTW, the headless service is deleted when the job completes or deleted.


However, only when the job is not started, the initialization is run.
So we need a way to know whether it is a scale up/down event that triggered this round of sync.

The way I propose is to add a new event `JobUpdatedEvent` to indicate that the job is updated(here only cares about the scale up/down).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When Pod was created/deleted, how /when to handle configmap? It's better to highlight the time sequence for the user.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a new OnJobUpdate method to the plugin

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a new OnJobUpdate method to the plugin

So?

we need to highlight when create pod, when configmap was updated and which pod will be deleted when scale down.

And accordingly add a new action `UpdateJobAction` to run `UpdateJob` function. And the overall workflow is:
![workflow](images/Job-scale-up-down.PNG)


### Admission webhook

Should prevent invalid mutating Job Spec on the fly. In this proposal, we only allow replicas update. Any other spec changes will be prohibited.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's our expected behaviour if minMember & replicas does not match?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

invalid update, the api calling will fail

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's document it, if it's the case