-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[discussion] specify total GPU count for distributed training #384
Comments
Thanks for your issue. I am not sure if I understand the idea. Do you mean that the operator should support assigning GPUs to PS and workers automatically? |
the k8s can support to assign GPUs to pod/worker automatically now if the limit of "nvidia.com/gpu" specified in pod YAML file. But for distributed training, it is not easy and user-friendly to use. we need to create each worker/pod and set the number of GPU in each pod separately. I mean, for distributed training, I think it's easier for the user to create a distributed training. |
I do not think we should hide the PS/worker from users at operator level, maybe we could build a config generator on top of tf-operator, which accepts the user code and number of GPUs, and generates the TFJob config. |
I agree we can implement this in the top of tf-operator. |
horovod looks promising. Is there something that could make that easier to use on K8s? |
I'm going to close this issue out because of lack of activity. |
I am not sure whether this can be discussed here.
suppose we have a k8s cluster with 5 nodes. each node has 8 GPUs, so there are 40 GPUs in total.
when a user starts a distributed training with 20 GPUs.
what we expect:
he just specifies the number 20 and does not need to split the GPUs request count manually. a controller or something else can do this automatically according to the current free GPU resources of the cluster.
e.g. 20 = 8 + 8 + 2 + 2.
at the same time, when the training ends, all pods can be deleted by this controller.
does
tensorflow/k8s
or other operator have this function?The text was updated successfully, but these errors were encountered: