-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FeatureRequest] Support dynamic volume provisioning for TFJob and PyTorchJob #949
Comments
@zhan849 Can you explain more on it wrt to the support required and changes in API? |
@johnugeorge I'm thinking about the following:
VolumeClaimTemplates []corev1.PersistentVolumeClaim `json:"volumeClaimTemplates,omitempty"`
|
I can send out the PR for the community to review if the general direction is agreed. |
/cc @richardsliu |
The idea SGTM while I think storage class already supports dynamic volume provision, what things should we do in the use case? |
@gaocegege yes storage class should define how the volume should actually be provisioned. We already started some experiments in a forked branch, do you guys want me to submit a brief proposal or you want to implement it. it'd be something similar to stateful set |
@zhan849 I am glad to see your proposal! Thanks for your contribution. |
Should we place the feature in common-operator? It is general for all PS-worker based training jobs |
@gaocegege Yes I think so. So it should be in the next API version. |
@gaocegege @richardsliu agreed. will close this one and open new one under common-operator. And also for design doc, given all the discussion threads under tf-operator, I'd suggest that we keep it there and check into community repo after the PR is finalized and merged |
There are a lot of machine learning use cases where we need large scratch spaces for jobs i.e. job needs to download 100s of GBs of data for processing. In cloud environment, block devices (EBS for example) has its advantages as we don't need to over provision the hosts with large host volume, nor do we need expensive shared file system (EFS for example), as machine learning workloads usually don't need to share local data.
Given such use cases, in Kubernetes, persistent volume would be a good and efficient type to support such use case. Current kubeflow requires user to provision persistent volume claims separately as shown in https://github.com/kubeflow/tf-operator/tree/631dd0e31b8bfbb59b2b6ab7a3ea501cb289d479/examples/v1beta1/mnist_with_summaries, which causes additional operation overheads.
I was wondering if we can add support for dynamic volume provisioning in
TFJobSpec
andPyTorchJobSpec
. A rough thought would be something similar to that of StatefulSet.We are happy to contribute and send out PR for implementing the feature.
The text was updated successfully, but these errors were encountered: