-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[discussion] Differences between tensorflow/k8s and caicloud/kubeflow-controller #283
Comments
@jlewi PTAL |
I think we could discuss the differences together, some of them are not necessary for upstream and we could remove them. Now I am trying to refactor the code and there are some things I could do now: #312 . But most of the refactor work is blocked by this issue. 😄 Look forward to receiving your reply |
TfImage TfImage is used as the default TensorFlow image for parameter servers and TensorBoard. If we get rid of it what image would we use for TensorBoard and parameter servers? TerminationPolicy TerminationPolicy as currently implemented is a property of the job and describes when the job should be considered terminated. So why would you add it to the replica? Dirs Per #224 lets add these at higher layers not in TfJobSpec. Type I think we should get rid of type and instead add properties that control the behavior of each replica e.g. restart behavior. Each replica can then be given a name by the user. IsDefaultPs Are you suggesting we get rid of support for standard gRPC servers for TF or keep it? TfPort You raise a good point about this being redundant with the ports in PodTemplateSpec. I think it makes sense to get rid of TfPort and just allow it to be specified in PodTemplateSpec. We could require that either
TfJobStatus Regarding having a map from "job_name + task_index" to status (submitted, created, failed), won't this get pretty verbose for really large jobs? Why wouldn't we just emit events for each "job_name + task_index"? What's the purpose of tracking individual replica state in the TfJobStatus? It looks like StatefulSet only tracks aggregate statistics like readyReplicas. When to create CRD See #281. |
@gaocegege Thanks for sorting out the difference. TfImage I prefer get rid of the TfImage field. For TensorBoard, it does not belong to the TFJob scope, so we just need to focus on PS. There are two alternative ways as below:
TerminationPolicy @gaocegege I remember that we ever offline discussed the field. Is it for fine-grained controlling and termination policy at replica level? Dirs We have implemented and wrapped them in a higher level component internally, so please ignore them. Type It's helpful for users to set the replicas name. IMO, an explicit IsDefaultPs Outside Google, it's not a common case to provide default PS. We'd better get rid of it and focus on TensorFlow distributed training job lifecycle management. TfPort Vote for TfJobStatus For more complicated case, like model parallelism, there are different computation works for each replica, so we have to get the status and index of them. Even for data parallelism case, when it comes to policy (restart, restart and restore, don't restart), we have to known the index of replicas. If chief worker fail, the whole distributed job may fail. If non-chief worker fail, we may have different policy. When to create CRD Agree. We can discuss it in #281 CRD Name I noticed that there is a difference of CRD name. We prefer |
@gaocegege @DjangoPeng Could we split this up into multiple issues for the unresolved issues? For me at least its easier to discuss a single issue per thread rather than a whole bunch of issues. Most of what your proposing seems good to me but it will be easier to track and resolve as separate issues. TfImageI'd be ok getting rid of TfImage and perhaps as a result TB and DefaultPs (since they depend on it). Running TB was in many ways a hack and there are probably better solutions for running TB. IsDefaultPsI'd be ok getting rid of this and considering alternative suggestions; e.g. providing a Docker image that could be used as a TF server and maybe some ksonnet templates. TypeI think this bears more discussion and relates to how we think about TerminationPolicy. e.g. should we derive TerminationPolicy for the replica from its type or should we allow TerminationPolicy to be set explicitly (in which case maybe we don't need type). TfPortSeems reasonable. I can't think of reason why users would want to specify a port. TfJobStatusI think determining the right API for Status deserves more thought and its own issue. What we have right now is very confusing. CRD NameMy vote would be to follow the Kubernetes conventions. |
I have opened four issues to track clear and definitive parts. A checklist has been added at the most top of this issue. PTAL @jlewi |
@ScorpioCPH Could you please open a dedicated issue for TfJobStatus? |
FYI #333. |
@ScorpioCPH I have updated it to the checklist. |
I think I've commented on all the issues. |
Closed by #492 |
Checklist
For tracking all issues we discussed above, we'd better set a checklist as below:
We plan to contribute our controller caicloud/kubeflow-controller to tensorflow/k8s. To do this we summarize the differences between tensorflow/k8s and caicloud/kubeflow-controller. Welcome comments to help us solve these conflicts :-)
Version: 2 (Every time I updated the content I will update the version number here)
CRD
TFJobSpec
TensorBoard support (TBD)
#209 (comment)
caicloud/kubeflow-controller's CRD only support training, while the
TFJobSpec
in tensorflow/k8s still hasTensorBoard
. I think we could remove the field if we do not support TensorBoard in the CRD.TfImage (TBD)
We think the image is not necessary since the TFReplica has a pod template, which contains
image
field.TerminationPolicy (TBD)
We placed the policy in
TFReplicaSpec
, not intTFJobSpec
, although we haven't implement the logic to finish the TFJob according to the condition defined inTerminationPolicySpec
. Personally, I think we should discuss about it.Dir (TBD)
There is an issue about it: #224. Personally, I think if this controller is not used directly by AI engineers and there is no need to add these dirs since these could be added in
TFReplicaSpec
's pod template, since we all know how to write pod specification.TFReplicaSpec
Type (TBD)
We have three type:
local
,worker
, andPS
. tensorflow/k8s hasmaster
,worker
, andPS
.IsDefaultPS (TBD)
We do not have a pre-defined grpc_tensorflow_server in the repository, so we do not have this field.
TfPort (TBD)
Container spec in pod template has
Port
so I am not sure if we need this field.TfJobStatus
TFReplicasStates (TBD)
Now we keep consistent with tensorflow/k8s but there is one comment to use a new field to replace
TFReplicasStates
: https://github.com/caicloud/kubeflow-controller/issues/80#issuecomment-356509119We think the TFReplicasStates misses a little info and we should keep all statuses of the instances.
Controller
When to create CRD (TBD)
There is a discussion about this section: #281. tensorflow/k8s creates CRD when the controller is initialized, and our controller assumes that the CRD has been created before the controller is run.
How to manage replicas (Solved)
There are lots of discussions: #45 and https://github.com/caicloud/kubeflow-controller/issues/71. Now we come to an agreement:
Pod is more appropriate, then we could reuse the code in caicloud/kubeflow-controller.
Operator or controller (Solved)
According to #206, we decide to move to the controller pattern
The text was updated successfully, but these errors were encountered: